[jira] [Commented] (SOLR-14256) Remove HashDocSet

2020-02-14 Thread David Smiley (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-14256?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17037430#comment-17037430
 ] 

David Smiley commented on SOLR-14256:
-

For better or worse, the code is on this PR for a related issue: 
https://github.com/apache/lucene-solr/pull/1257

> Remove HashDocSet
> -
>
> Key: SOLR-14256
> URL: https://issues.apache.org/jira/browse/SOLR-14256
> Project: Solr
>  Issue Type: Task
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: search
>Reporter: David Smiley
>Priority: Major
>
> This particular DocSet is only used in places where we need to convert 
> SortedIntDocSet in particular to a DocSet that is fast for random access.  
> Once such a conversion happens, it's only used to test some docs for presence 
> and it could be another interface.  DocSet has kind of a large-ish API 
> surface area to implement.  Since we only need to test docs, we could use 
> Bits interface (having only 2 methods) backed by an off-the-shelf primitive 
> long hash set on our classpath.  Perhaps a new method on DocSet: getBits() or 
> DocSetUtil.getBits(DocSet).
> In addition to removing complexity unto itself, this improvement is required 
> by SOLR-14185 because it wants to be able to produce a DocIdSetIterator slice 
> directly from the DocSet but HashDocSet can't do that without sorting first.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9222) Detect upgrades with non-default formats

2020-02-14 Thread David Smiley (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17037429#comment-17037429
 ] 

David Smiley commented on LUCENE-9222:
--

>  will always throw a IndexFormatTooOldException on upgrade, even between 
> minor versions

Yeah, he's saying force this exception _even if there is no actual 
incompatibility_.  Sorry but I really hate this idea; this format hasn't 
changed for _many_ releases.

I think instead we need to remember to update the underlying postingsFormat 
name when a change does occur, _which is already versioned_ (e.g. "FST50" -> 
"FST84").  Maybe we should have a test to help us identify these breaks, 
because I totally appreciate it that it's not always clear.

> Detect upgrades with non-default formats
> 
>
> Key: LUCENE-9222
> URL: https://issues.apache.org/jira/browse/LUCENE-9222
> Project: Lucene - Core
>  Issue Type: Wish
>Reporter: Adrien Grand
>Priority: Minor
>
> Lucene doesn't give any backward-compatibility guarantees with non-default 
> formats, but doesn't try to detect such misuse either, and a couple users 
> fell in this trap over the years, see e.g. SOLR-14254.
> What about dynamically creating the version number of the index format based 
> on the current Lucene version, so that Lucene would fail with an 
> IndexFormatTooOldException with non-default formats instead of a confusing 
> CorruptIndexException. The change would consist of doing something like that 
> for all our non-default index formats:
> {code}
> diff --git 
> a/lucene/codecs/src/java/org/apache/lucene/codecs/memory/FSTTermsWriter.java 
> b/lucene/codecs/src/java/org/apache/lucene/codecs/memory/FSTTermsWriter.java
> index fcc0d00a593..18b35760aec 100644
> --- 
> a/lucene/codecs/src/java/org/apache/lucene/codecs/memory/FSTTermsWriter.java
> +++ 
> b/lucene/codecs/src/java/org/apache/lucene/codecs/memory/FSTTermsWriter.java
> @@ -41,6 +41,7 @@ import org.apache.lucene.util.BytesRef;
>  import org.apache.lucene.util.FixedBitSet;
>  import org.apache.lucene.util.IOUtils;
>  import org.apache.lucene.util.IntsRefBuilder;
> +import org.apache.lucene.util.Version;
>  import org.apache.lucene.util.fst.FSTCompiler;
>  import org.apache.lucene.util.fst.FST;
>  import org.apache.lucene.util.fst.Util;
> @@ -123,7 +124,7 @@ import org.apache.lucene.util.fst.Util;
>  public class FSTTermsWriter extends FieldsConsumer {
>static final String TERMS_EXTENSION = "tfp";
>static final String TERMS_CODEC_NAME = "FSTTerms";
> -  public static final int TERMS_VERSION_START = 2;
> +  public static final int TERMS_VERSION_START = (Version.LATEST.major << 16) 
> | (Version.LATEST.minor << 8) | Version.LATEST.bugfix;
>public static final int TERMS_VERSION_CURRENT = TERMS_VERSION_START;
>
>final PostingsWriterBase postingsWriter;
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] dsmiley commented on issue #1257: SOLR-14258: DocList should not extend DocSet

2020-02-14 Thread GitBox
dsmiley commented on issue #1257: SOLR-14258: DocList should not extend DocSet
URL: https://github.com/apache/lucene-solr/pull/1257#issuecomment-586556227
 
 
   BTW I'm about to step away on vacation for a week with spotty internet 
access and no computer.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] dsmiley commented on issue #1257: SOLR-14258: DocList should not extend DocSet

2020-02-14 Thread GitBox
dsmiley commented on issue #1257: SOLR-14258: DocList should not extend DocSet
URL: https://github.com/apache/lucene-solr/pull/1257#issuecomment-586556130
 
 
   @mkhludnev Thanks for the approval on the first commit about SOLR-14258 
(DocList shouldn't implement DocSet).  I forged ahead with SOLR-14256 (remove 
HashDocSet) on the same PR because I think in the end, one commit doing both is 
even more compelling in totality.
   
   In short:  DocSet is now immutable and is now always in a natural document 
ID order.  There are now only exactly two implementations (newly enforced).  
The guaranteed O(1) set detection is now handled with a getBits method, and 
this is more elegant for the callers that need it than the previous code idiom 
that had to do instanceof checks.  The back-compat concern of all changes here 
is pretty low, I think.
   
   I'm less than 100% sure it's okay that SortedIntDocSet's getBits has a 
getLength that'll typically be less than maxDoc of the segment.  Tests pass, 
so... okay?  If we're not okay with this, we'll need the constructor to pass in 
maxDoc just as already occurs for FixBitSet used in BitDocSet.  WDYT @yonik 
   
   I could imagine abandoning SortedIntDocSet in favor of only BitDocSet with 
modifications to be more general by supporting SparsedFixedBitSet (thus both 
BitSet implementations).  Or maybe practically speaking it'd need another 
class; I dunno.  Definitely not something I want to explore at this time though.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9004) Approximate nearest vector search

2020-02-14 Thread Tomoko Uchida (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17037404#comment-17037404
 ] 

Tomoko Uchida commented on LUCENE-9004:
---

{quote}It can readily be shown that HNSW performs much better in query time. 
But I was surprised that top 1 in-set recall percent of HNSW is so low. It 
shouldn't be a problem of algorithm itself, but more likely a problem of 
implementation or test code. I will check it this weekend.
{quote}
Thanks [~irvingzhang] for measuring. I noticed I might have made a very basic 
mistake when comparing neighborhood nodes, maybe some inequality signs should 
be flipped :/
 I will do recall performance tests with GloVE and fix the bugs.

> Approximate nearest vector search
> -
>
> Key: LUCENE-9004
> URL: https://issues.apache.org/jira/browse/LUCENE-9004
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Michael Sokolov
>Priority: Major
> Attachments: hnsw_layered_graph.png
>
>  Time Spent: 3h 20m
>  Remaining Estimate: 0h
>
> "Semantic" search based on machine-learned vector "embeddings" representing 
> terms, queries and documents is becoming a must-have feature for a modern 
> search engine. SOLR-12890 is exploring various approaches to this, including 
> providing vector-based scoring functions. This is a spinoff issue from that.
> The idea here is to explore approximate nearest-neighbor search. Researchers 
> have found an approach based on navigating a graph that partially encodes the 
> nearest neighbor relation at multiple scales can provide accuracy > 95% (as 
> compared to exact nearest neighbor calculations) at a reasonable cost. This 
> issue will explore implementing HNSW (hierarchical navigable small-world) 
> graphs for the purpose of approximate nearest vector search (often referred 
> to as KNN or k-nearest-neighbor search).
> At a high level the way this algorithm works is this. First assume you have a 
> graph that has a partial encoding of the nearest neighbor relation, with some 
> short and some long-distance links. If this graph is built in the right way 
> (has the hierarchical navigable small world property), then you can 
> efficiently traverse it to find nearest neighbors (approximately) in log N 
> time where N is the number of nodes in the graph. I believe this idea was 
> pioneered in  [1]. The great insight in that paper is that if you use the 
> graph search algorithm to find the K nearest neighbors of a new document 
> while indexing, and then link those neighbors (undirectedly, ie both ways) to 
> the new document, then the graph that emerges will have the desired 
> properties.
> The implementation I propose for Lucene is as follows. We need two new data 
> structures to encode the vectors and the graph. We can encode vectors using a 
> light wrapper around {{BinaryDocValues}} (we also want to encode the vector 
> dimension and have efficient conversion from bytes to floats). For the graph 
> we can use {{SortedNumericDocValues}} where the values we encode are the 
> docids of the related documents. Encoding the interdocument relations using 
> docids directly will make it relatively fast to traverse the graph since we 
> won't need to lookup through an id-field indirection. This choice limits us 
> to building a graph-per-segment since it would be impractical to maintain a 
> global graph for the whole index in the face of segment merges. However 
> graph-per-segment is a very natural at search time - we can traverse each 
> segments' graph independently and merge results as we do today for term-based 
> search.
> At index time, however, merging graphs is somewhat challenging. While 
> indexing we build a graph incrementally, performing searches to construct 
> links among neighbors. When merging segments we must construct a new graph 
> containing elements of all the merged segments. Ideally we would somehow 
> preserve the work done when building the initial graphs, but at least as a 
> start I'd propose we construct a new graph from scratch when merging. The 
> process is going to be  limited, at least initially, to graphs that can fit 
> in RAM since we require random access to the entire graph while constructing 
> it: In order to add links bidirectionally we must continually update existing 
> documents.
> I think we want to express this API to users as a single joint 
> {{KnnGraphField}} abstraction that joins together the vectors and the graph 
> as a single joint field type. Mostly it just looks like a vector-valued 
> field, but has this graph attached to it.
> I'll push a branch with my POC and would love to hear comments. It has many 
> nocommits, basic design is not really set, there is no Query implementation 
> and no integration iwth IndexSearcher, but it does work by some measure using 
> a 

[jira] [Comment Edited] (LUCENE-9136) Introduce IVFFlat to Lucene for ANN similarity search

2020-02-14 Thread Julie Tibshirani (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17037325#comment-17037325
 ] 

Julie Tibshirani edited comment on LUCENE-9136 at 2/15/20 1:44 AM:
---

Hello [~irvingzhang], to me this looks like a really interesting direction! We 
also found in our research that k-means clustering (IVFFlat) could achieve high 
recall with a relatively low number of distance computations. It performs well 
compared to KD-trees and LSH, although it tends to require substantially more 
distance computations than HNSW. A nice property of the approach is that it's 
based on a classic algorithm, k-means – it is easy to understand, and has few 
tuning parameters.

I wonder if this clustering-based approach could fit more closely in the 
current search framework. In the current prototype, we keep all the cluster 
information on-heap. We could instead try storing each cluster as its own 
'term' with a postings list. The kNN query would then be modelled as an 'OR' 
over these terms.

A major concern about clustering-based approaches is the high indexing cost. 
K-means is a heavy operation in itself. And even if we only use subsample of 
documents during k-means, we must compare each indexed document to all 
centroids to assign it to the right cluster. With the heuristic of using 
sqrt\(n) centroids, this could give poor scaling behavior when indexing large 
segments. Because of this concern, it could be nice to include benchmarks for 
index time (in addition to QPS). A couple more thoughts on this point:
 * FAISS helps address the concern by using an ANN algorithm to do the cluster 
assignment. In particular, it provides an option to use k-means clustering 
(IVFFlat), but do the cluster assignment using HNSW: 
[https://github.com/facebookresearch/faiss/wiki/Guidelines-to-choose-an-index#how-big-is-the-dataset].
 This seemed like a potentially interesting direction.
 * There could also be ways to streamline the k-means step. As an example, I 
experimented with FAISS's implementation of IVFFlat, and found that I could run 
very few k-means iterations, but still achieve similar performance. Here are 
some results on a dataset of ~1.2 million GloVe word vectors, using sqrt\(n) 
centroids. The cell values represent recall for a kNN search with k=10:

*{{approach          10 probes  20 probes  100 probes  200 probes}}*
 {{random centroids  0.578      0.68       0.902       0.961}}
 {{k-means, 1 iter   0.729      0.821      0.961       0.987}}
 {{k-means, 2 iters  0.775      0.854      0.968       0.989}}
 {{k-means, 20 iters 0.806      0.875      0.972       0.991}}

 


was (Author: jtibshirani):
Hello [~irvingzhang], to me this looks like a really interesting direction! We 
also found in our research that k-means clustering (IVFFlat) could achieve high 
recall with a relatively low number of distance computations. It performs well 
compared to KD-trees and LSH, although it tends to require substantially more 
distance computations than HNSW. A nice property of the approach is that it's 
based on a classic algorithm, k-means – it is easy to understand, and has few 
tuning parameters.

I wonder if this clustering-based approach could fit more closely in the 
current search framework. In the current prototype, we keep all the cluster 
information on-heap. We could instead try storing each cluster as its own 
'term' with a postings list. The kNN query would then be modelled as an 'OR' 
over these terms.

A major concern about clustering-based approaches is the high indexing cost. 
K-means is a heavy operation in itself. And even if we only use subsample of 
documents during k-means, we must compare each indexed document to all 
centroids to assign it to the right cluster. With the heuristic of using 
sqrt\(n) centroids, this could give poor scaling behavior at index time. A 
couple thoughts on this point:
 * FAISS helps address this concern by using an ANN algorithm to do the cluster 
assignment. In particular, it provides an option to use k-means clustering 
(IVFFlat), but do the cluster assignment using HNSW: 
[https://github.com/facebookresearch/faiss/wiki/Guidelines-to-choose-an-index#how-big-is-the-dataset].
 This seemed like a potentially interesting direction.
 * There could also be ways to streamline the k-means step. I experimented with 
FAISS's implementation of IVFFlat, and found that I could run very few k-means 
iterations, but still achieve similar performance. Here are some results on a 
dataset of ~1.2 million GloVe word vectors, using sqrt\(n) centroids. The cell 
values represent recall for a kNN search with k=10:

*{{approach          10 probes  20 probes  100 probes  200 probes}}*
 {{random centroids  0.578      0.68       0.902       0.961}}
 {{k-means, 1 iter   0.729      0.821      0.961       0.987}}
 {{k-means, 2 iters  0.775      0.854      0.968       0.989}}
 {{k-means, 

[GitHub] [lucene-solr] ErickErickson commented on issue #1022: SOLR-13953: Update eviction behavior of cache in Solr Prometheus exporter to allow for larger clusters

2020-02-14 Thread GitBox
ErickErickson commented on issue #1022: SOLR-13953: Update eviction behavior of 
cache in Solr Prometheus exporter to allow for larger clusters
URL: https://github.com/apache/lucene-solr/pull/1022#issuecomment-586538447
 
 
   Forgot to close this when I fixed the JIRA.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] ErickErickson closed pull request #1022: SOLR-13953: Update eviction behavior of cache in Solr Prometheus exporter to allow for larger clusters

2020-02-14 Thread GitBox
ErickErickson closed pull request #1022: SOLR-13953: Update eviction behavior 
of cache in Solr Prometheus exporter to allow for larger clusters
URL: https://github.com/apache/lucene-solr/pull/1022
 
 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (SOLR-14263) jvm-settings.adoc is way out of date

2020-02-14 Thread Erick Erickson (Jira)


 [ 
https://issues.apache.org/jira/browse/SOLR-14263?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Erick Erickson updated SOLR-14263:
--
Attachment: SOLR-14263.patch
Status: Open  (was: Open)

Here's an initial whack at changing this page. Feel free to wordsmith, I'm not 
entirely satisfied with it, I'll look some more later.

> jvm-settings.adoc is way out of date
> 
>
> Key: SOLR-14263
> URL: https://issues.apache.org/jira/browse/SOLR-14263
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: documentation
>Reporter: Erick Erickson
>Assignee: Erick Erickson
>Priority: Major
> Attachments: SOLR-14263.patch
>
>
> First of all it talks about a two gigabyte heap. Second, I thought we were 
> usually recommending -Xmx and -Xms be the same. I'll have a revision up 
> shortly, I'm thinking of some major surgery on it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Created] (SOLR-14263) jvm-settings.adoc is way out of date

2020-02-14 Thread Erick Erickson (Jira)
Erick Erickson created SOLR-14263:
-

 Summary: jvm-settings.adoc is way out of date
 Key: SOLR-14263
 URL: https://issues.apache.org/jira/browse/SOLR-14263
 Project: Solr
  Issue Type: Improvement
  Security Level: Public (Default Security Level. Issues are Public)
  Components: documentation
Reporter: Erick Erickson
Assignee: Erick Erickson


First of all it talks about a two gigabyte heap. Second, I thought we were 
usually recommending -Xmx and -Xms be the same. I'll have a revision up 
shortly, I'm thinking of some major surgery on it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] athrog commented on issue #1261: SOLR-14260: Make SchemaRegistryProvider pluggable in HttpClientUtil

2020-02-14 Thread GitBox
athrog commented on issue #1261: SOLR-14260: Make SchemaRegistryProvider 
pluggable in HttpClientUtil
URL: https://github.com/apache/lucene-solr/pull/1261#issuecomment-586520791
 
 
   FYI @dsmiley 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] athrog opened a new pull request #1261: SOLR-14260: Make SchemaRegistryProvider pluggable in HttpClientUtil

2020-02-14 Thread GitBox
athrog opened a new pull request #1261: SOLR-14260: Make SchemaRegistryProvider 
pluggable in HttpClientUtil
URL: https://github.com/apache/lucene-solr/pull/1261
 
 
   
   
   
   # Description
   HttpClientUtil.java defines and uses an abstract SchemaRegistryProvider for 
mapping a protocol to an Apache ConnectionSocketFactory. There is only one 
implementation of this abstract class (outside of test cases). Currently, it is 
not override-able at runtime.
   
   # Solution
   Adds the ability to override the schema registry provider at runtime, using 
the class name value provided by "solr.schema.registry.provider", similar to 
how this class allows for choosing the HttpClientBuilderFactory at runtime.
   
   # Tests
   We have been using this patch in our internal fork of SOLR. We have verified 
that SOLR communication is encrypted, and since this patch helps us enable that 
encryption by setting a custom SSLContext for HTTP clients, we know this patch 
is working as expected. 
   
   # Checklist
   
   Please review the following and check all that apply:
   
   - [x] I have reviewed the guidelines for [How to 
Contribute](https://wiki.apache.org/solr/HowToContribute) and my code conforms 
to the standards described there to the best of my ability.
   - [x] I have created a Jira issue and added the issue ID to my pull request 
title.
   - [x] I have given Solr maintainers 
[access](https://help.github.com/en/articles/allowing-changes-to-a-pull-request-branch-created-from-a-fork)
 to contribute to my PR branch. (optional but recommended)
   - [x] I have developed this patch against the `master` branch.
   - [x] I have run `ant precommit` and the appropriate test suite.
   - [ ] I have added tests for my changes.
   - [ ] I have added documentation for the [Ref 
Guide](https://github.com/apache/lucene-solr/tree/master/solr/solr-ref-guide) 
(for Solr changes only).
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Resolved] (SOLR-13794) Delete solr/core/src/test-files/solr/configsets/_default

2020-02-14 Thread Chris M. Hostetter (Jira)


 [ 
https://issues.apache.org/jira/browse/SOLR-13794?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris M. Hostetter resolved SOLR-13794.
---
Fix Version/s: 8.5
   master (9.0)
   Resolution: Fixed

> Delete solr/core/src/test-files/solr/configsets/_default
> 
>
> Key: SOLR-13794
> URL: https://issues.apache.org/jira/browse/SOLR-13794
> Project: Solr
>  Issue Type: Test
>Reporter: Chris M. Hostetter
>Assignee: Chris M. Hostetter
>Priority: Major
> Fix For: master (9.0), 8.5
>
> Attachments: SOLR-13794.patch, SOLR-13794.patch, 
> SOLR-13794_code_only.patch, SOLR-13794_code_only.patch
>
>
> For as long as we've had a {{_default}} configset in solr, we've also had a 
> copy of that default in {{core/src/test-files/}} - as well as a unit test 
> that confirms they are identical.
> It's never really been clear to me *why* we have this duplication, instead of 
> just having the test-framework take the necessary steps to ensure that 
> {{server/solr/configsets/_default}} is properly used when running tests.
> I'd like to propose we eliminate the duplication since it only ever seems to 
> cause problems (notably spurious test failures when people modify the 
> {{_default}} configset w/o remembering that they need to make identical edits 
> to the {{test-files}} clone) and instead have {{SolrTestCase}} set the 
> (already existing & supported) {{solr.default.confdir}} system property to 
> point to the (already existing) {{ExternalPaths.DEFAULT_CONFIGSET}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Comment Edited] (LUCENE-9136) Introduce IVFFlat to Lucene for ANN similarity search

2020-02-14 Thread Julie Tibshirani (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17037325#comment-17037325
 ] 

Julie Tibshirani edited comment on LUCENE-9136 at 2/14/20 10:04 PM:


Hello [~irvingzhang], to me this looks like a really interesting direction! We 
also found in our research that k-means clustering (IVFFlat) could achieve high 
recall with a relatively low number of distance computations. It performs well 
compared to KD-trees and LSH, although it tends to require substantially more 
distance computations than HNSW. A nice property of the approach is that it's 
based on a classic algorithm, k-means – it is easy to understand, and has few 
tuning parameters.

I wonder if this clustering-based approach could fit more closely in the 
current search framework. In the current prototype, we keep all the cluster 
information on-heap. We could instead try storing each cluster as its own 
'term' with a postings list. The kNN query would then be modelled as an 'OR' 
over these terms.

A major concern about clustering-based approaches is the high indexing cost. 
K-means is a heavy operation in itself. And even if we only use subsample of 
documents during k-means, we must compare each indexed document to all 
centroids to assign it to the right cluster. With the heuristic of using 
sqrt\(n) centroids, this could give poor scaling behavior at index time. A 
couple thoughts on this point:
 * FAISS helps address this concern by using an ANN algorithm to do the cluster 
assignment. In particular, it provides an option to use k-means clustering 
(IVFFlat), but do the cluster assignment using HNSW: 
[https://github.com/facebookresearch/faiss/wiki/Guidelines-to-choose-an-index#how-big-is-the-dataset].
 This seemed like a potentially interesting direction.
 * There could also be ways to streamline the k-means step. I experimented with 
FAISS's implementation of IVFFlat, and found that I could run very few k-means 
iterations, but still achieve similar performance. Here are some results on a 
dataset of ~1.2 million GloVe word vectors, using sqrt\(n) centroids. The cell 
values represent recall for a kNN search with k=10:

*{{approach          10 probes  20 probes  100 probes  200 probes}}*
 {{random centroids  0.578      0.68       0.902       0.961}}
 {{k-means, 1 iter   0.729      0.821      0.961       0.987}}
 {{k-means, 2 iters  0.775      0.854      0.968       0.989}}
 {{k-means, 20 iters 0.806      0.875      0.972       0.991}}

 


was (Author: jtibshirani):
Hello [~irvingzhang], to me this looks like a really interesting direction! We 
also found in our research that k-means clustering (IVFFlat) could achieve high 
recall with a relatively low number of distance computations. It performs well 
compared to KD-trees and LSH, although it tends to require substantially more 
distance computations than HNSW. A nice property of the approach is that it's 
based on a classic algorithm, k-means – it is easy to understand, and has few 
tuning parameters.

I wonder if this clustering-based approach could fit more closely in the 
current search framework. In the current prototype, we keep all the cluster 
information on-heap. We could instead try storing each cluster as its own 
'term' with a postings list. The kNN query would then be modelled as an 'OR' 
over these terms.

A major concern about clustering-based approaches is the high indexing cost. 
K-means is a heavy operation in itself. And even if we only use subsample of 
documents during k-means, we must compare each indexed document to all 
centroids to assign it to the right cluster. With the heuristic of using 
sqrt(n) centroids, this could give poor scaling behavior at index time. A 
couple thoughts on this point:
 * FAISS helps address this concern by using an ANN algorithm to do the cluster 
assignment. In particular, it provides an option to use k-means clustering 
(IVFFlat), but do the cluster assignment using HNSW: 
[https://github.com/facebookresearch/faiss/wiki/Guidelines-to-choose-an-index#how-big-is-the-dataset].
 This seemed like a potentially interesting direction.
 * There could also be ways to streamline the k-means step. I experimented with 
FAISS's implementation of IVFFlat, and found that I could run very few k-means 
iterations, but still achieve similar performance. Here are some results on a 
dataset of ~1.2 million GloVe word vectors, using sqrt(n) centroids. The cell 
values represent recall for a kNN search with k=10:

*{{approach          10 probes  20 probes  100 probes  200 probes}}*
 {{random centroids  0.578      0.68       0.902       0.961}}
 {{k-means, 1 iter   0.729      0.821      0.961       0.987}}
 {{k-means, 2 iters  0.775      0.854      0.968       0.989}}
 {{k-means, 20 iters 0.806      0.875      0.972       0.991}}

 

> Introduce IVFFlat to Lucene for ANN similarity search
> 

[jira] [Comment Edited] (LUCENE-9136) Introduce IVFFlat to Lucene for ANN similarity search

2020-02-14 Thread Julie Tibshirani (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17037325#comment-17037325
 ] 

Julie Tibshirani edited comment on LUCENE-9136 at 2/14/20 10:03 PM:


Hello [~irvingzhang], to me this looks like a really interesting direction! We 
also found in our research that k-means clustering (IVFFlat) could achieve high 
recall with a relatively low number of distance computations. It performs well 
compared to KD-trees and LSH, although it tends to require substantially more 
distance computations than HNSW. A nice property of the approach is that it's 
based on a classic algorithm, k-means – it is easy to understand, and has few 
tuning parameters.

I wonder if this clustering-based approach could fit more closely in the 
current search framework. In the current prototype, we keep all the cluster 
information on-heap. We could instead try storing each cluster as its own 
'term' with a postings list. The kNN query would then be modelled as an 'OR' 
over these terms.

A major concern about clustering-based approaches is the high indexing cost. 
K-means is a heavy operation in itself. And even if we only use subsample of 
documents during k-means, we must compare each indexed document to all 
centroids to assign it to the right cluster. With the heuristic of using 
sqrt(n) centroids, this could give poor scaling behavior at index time. A 
couple thoughts on this point:
 * FAISS helps address this concern by using an ANN algorithm to do the cluster 
assignment. In particular, it provides an option to use k-means clustering 
(IVFFlat), but do the cluster assignment using HNSW: 
[https://github.com/facebookresearch/faiss/wiki/Guidelines-to-choose-an-index#how-big-is-the-dataset].
 This seemed like a potentially interesting direction.
 * There could also be ways to streamline the k-means step. I experimented with 
FAISS's implementation of IVFFlat, and found that I could run very few k-means 
iterations, but still achieve similar performance. Here are some results on a 
dataset of ~1.2 million GloVe word vectors, using sqrt(n) centroids. The cell 
values represent recall for a kNN search with k=10:

*{{approach          10 probes  20 probes  100 probes  200 probes}}*
 {{random centroids  0.578      0.68       0.902       0.961}}
 {{k-means, 1 iter   0.729      0.821      0.961       0.987}}
 {{k-means, 2 iters  0.775      0.854      0.968       0.989}}
 {{k-means, 20 iters 0.806      0.875      0.972       0.991}}

 


was (Author: jtibshirani):
Hello [~irvingzhang], to me this looks like a really interesting direction! We 
also found in our research that k-means clustering (IVFFlat) could achieve high 
recall with a relatively low number of distance computations. It performs well 
compared to KD-trees and LSH, although it tends to require more distance 
computations than HNSW. A nice property of the approach is that it's based on a 
classic algorithm, k-means -- it is easy to understand, and has few tuning 
parameters.

I wonder if this clustering-based approach could fit more closely in the 
current search framework. In the current prototype, we keep all the cluster 
information on-heap. We could instead try storing each cluster as its own 
'term' with a postings list. The kNN query would then be modelled as an 'OR' 
over these terms.

A major concern about clustering-based approaches is the high indexing cost. 
K-means is a heavy operation in itself. And even if we only use subsample of 
documents during k-means, we must compare each indexed document to all 
centroids to assign it to the right cluster. With the heuristic of using 
sqrt(n) centroids, this could give poor scaling behavior at index time. A 
couple thoughts on this point:
 * FAISS helps address this concern by using an ANN algorithm to do the cluster 
assignment. In particular, it provides an option to use k-means clustering 
(IVFFlat), but do the cluster assignment using HNSW: 
https://github.com/facebookresearch/faiss/wiki/Guidelines-to-choose-an-index#how-big-is-the-dataset.
 This seemed like a potentially interesting direction.
 * There could also be ways to streamline the k-means step. I experimented with 
FAISS's implementation of IVFFlat, and found that I could run very few k-means 
iterations, but still achieve similar performance. Here are some results on a 
dataset of ~1.2 million GloVe word vectors, using sqrt(n) centroids. The cell 
values represent recall for a kNN search with k=10:

*{{approach          10 probes  20 probes  100 probes  200 probes}}*
{{random centroids  0.578      0.68       0.902       0.961}}
{{k-means, 1 iter   0.729      0.821      0.961       0.987}}
{{k-means, 2 iters  0.775      0.854      0.968       0.989}}
{{k-means, 20 iters 0.806      0.875      0.972       0.991}}

 

> Introduce IVFFlat to Lucene for ANN similarity search
> 

[jira] [Commented] (LUCENE-9136) Introduce IVFFlat to Lucene for ANN similarity search

2020-02-14 Thread Julie Tibshirani (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17037325#comment-17037325
 ] 

Julie Tibshirani commented on LUCENE-9136:
--

Hello [~irvingzhang], to me this looks like a really interesting direction! We 
also found in our research that k-means clustering (IVFFlat) could achieve high 
recall with a relatively low number of distance computations. It performs well 
compared to KD-trees and LSH, although it tends to require more distance 
computations than HNSW. A nice property of the approach is that it's based on a 
classic algorithm, k-means -- it is easy to understand, and has few tuning 
parameters.

I wonder if this clustering-based approach could fit more closely in the 
current search framework. In the current prototype, we keep all the cluster 
information on-heap. We could instead try storing each cluster as its own 
'term' with a postings list. The kNN query would then be modelled as an 'OR' 
over these terms.

A major concern about clustering-based approaches is the high indexing cost. 
K-means is a heavy operation in itself. And even if we only use subsample of 
documents during k-means, we must compare each indexed document to all 
centroids to assign it to the right cluster. With the heuristic of using 
sqrt(n) centroids, this could give poor scaling behavior at index time. A 
couple thoughts on this point:
 * FAISS helps address this concern by using an ANN algorithm to do the cluster 
assignment. In particular, it provides an option to use k-means clustering 
(IVFFlat), but do the cluster assignment using HNSW: 
https://github.com/facebookresearch/faiss/wiki/Guidelines-to-choose-an-index#how-big-is-the-dataset.
 This seemed like a potentially interesting direction.
 * There could also be ways to streamline the k-means step. I experimented with 
FAISS's implementation of IVFFlat, and found that I could run very few k-means 
iterations, but still achieve similar performance. Here are some results on a 
dataset of ~1.2 million GloVe word vectors, using sqrt(n) centroids. The cell 
values represent recall for a kNN search with k=10:

*{{approach          10 probes  20 probes  100 probes  200 probes}}*
{{random centroids  0.578      0.68       0.902       0.961}}
{{k-means, 1 iter   0.729      0.821      0.961       0.987}}
{{k-means, 2 iters  0.775      0.854      0.968       0.989}}
{{k-means, 20 iters 0.806      0.875      0.972       0.991}}

 

> Introduce IVFFlat to Lucene for ANN similarity search
> -
>
> Key: LUCENE-9136
> URL: https://issues.apache.org/jira/browse/LUCENE-9136
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Xin-Chun Zhang
>Priority: Major
> Attachments: 1581409981369-9dea4099-4e41-4431-8f45-a3bb8cac46c0.png
>
>
> Representation learning (RL) has been an established discipline in the 
> machine learning space for decades but it draws tremendous attention lately 
> with the emergence of deep learning. The central problem of RL is to 
> determine an optimal representation of the input data. By embedding the data 
> into a high dimensional vector, the vector retrieval (VR) method is then 
> applied to search the relevant items.
> With the rapid development of RL over the past few years, the technique has 
> been used extensively in industry from online advertising to computer vision 
> and speech recognition. There exist many open source implementations of VR 
> algorithms, such as Facebook's FAISS and Microsoft's SPTAG, providing various 
> choices for potential users. However, the aforementioned implementations are 
> all written in C++, and no plan for supporting Java interface, making it hard 
> to be integrated in Java projects or those who are not familier with C/C++  
> [[https://github.com/facebookresearch/faiss/issues/105]]. 
> The algorithms for vector retrieval can be roughly classified into four 
> categories,
>  # Tree-base algorithms, such as KD-tree;
>  # Hashing methods, such as LSH (Local Sensitive Hashing);
>  # Product quantization based algorithms, such as IVFFlat;
>  # Graph-base algorithms, such as HNSW, SSG, NSG;
> where IVFFlat and HNSW are the most popular ones among all the VR algorithms.
> IVFFlat is better for high-precision applications such as face recognition, 
> while HNSW performs better in general scenarios including recommendation and 
> personalized advertisement. *The recall ratio of IVFFlat could be gradually 
> increased by adjusting the query parameter (nprobe), while it's hard for HNSW 
> to improve its accuracy*. In theory, IVFFlat could achieve 100% recall ratio. 
> Recently, the implementation of HNSW (Hierarchical Navigable Small World, 
> LUCENE-9004) for Lucene, has made great progress. The issue draws attention 
> of those who are interested in Lucene or hope 

[GitHub] [lucene-solr] HoustonPutman commented on issue #1260: SOLR-13669: DIH: Add System property toggle for use of dataConfig param

2020-02-14 Thread GitBox
HoustonPutman commented on issue #1260: SOLR-13669: DIH: Add System property 
toggle for use of dataConfig param
URL: https://github.com/apache/lucene-solr/pull/1260#issuecomment-586484779
 
 
   Ok, so the added tests work, and precommit passes. This should be good to go.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] HoustonPutman commented on issue #1260: SOLR-13669: DIH: Add System property toggle for use of dataConfig param

2020-02-14 Thread GitBox
HoustonPutman commented on issue #1260: SOLR-13669: DIH: Add System property 
toggle for use of dataConfig param
URL: https://github.com/apache/lucene-solr/pull/1260#issuecomment-586480503
 
 
   The license file name wasn't updated when the library name was changed in 
the previous commit.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-13020) Additional Hooks in SolrEventListener

2020-02-14 Thread David Smiley (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-13020?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17037294#comment-17037294
 ] 

David Smiley commented on SOLR-13020:
-

SolrEventListener seems to be of little use for those UpdateCommands given that 
you could just add an URP; no?

> Additional Hooks in SolrEventListener
> -
>
> Key: SOLR-13020
> URL: https://issues.apache.org/jira/browse/SOLR-13020
> Project: Solr
>  Issue Type: Improvement
>  Components: Plugin system
>Reporter: Kevin Jia
>Priority: Minor
>
> Add more hooks in SolrEventListener to allow for greater user customization. 
> Proposed hooks are: 
> public void postCoreConstruct(SolrCore core);
> public void preAddDoc(AddUpdateCommand cmd);
> public void postAddDoc(AddUpdateCommand cmd);
> public void preDelete(DeleteUpdateCommand cmd);
> public void postDelete(DeleteUpdateCommand cmd);



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9222) Detect upgrades with non-default formats

2020-02-14 Thread Cassandra Targett (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17037263#comment-17037263
 ] 

Cassandra Targett commented on LUCENE-9222:
---

I'm trying to wrap my head around this, but don't really get the idea. Is this 
proposal effectively saying that any index using a non-default codec will 
always throw a IndexFormatTooOldException on upgrade, even between minor 
versions? Or is it more about improving the type of error that is thrown if the 
non-default codec has been changed after an upgrade?

> Detect upgrades with non-default formats
> 
>
> Key: LUCENE-9222
> URL: https://issues.apache.org/jira/browse/LUCENE-9222
> Project: Lucene - Core
>  Issue Type: Wish
>Reporter: Adrien Grand
>Priority: Minor
>
> Lucene doesn't give any backward-compatibility guarantees with non-default 
> formats, but doesn't try to detect such misuse either, and a couple users 
> fell in this trap over the years, see e.g. SOLR-14254.
> What about dynamically creating the version number of the index format based 
> on the current Lucene version, so that Lucene would fail with an 
> IndexFormatTooOldException with non-default formats instead of a confusing 
> CorruptIndexException. The change would consist of doing something like that 
> for all our non-default index formats:
> {code}
> diff --git 
> a/lucene/codecs/src/java/org/apache/lucene/codecs/memory/FSTTermsWriter.java 
> b/lucene/codecs/src/java/org/apache/lucene/codecs/memory/FSTTermsWriter.java
> index fcc0d00a593..18b35760aec 100644
> --- 
> a/lucene/codecs/src/java/org/apache/lucene/codecs/memory/FSTTermsWriter.java
> +++ 
> b/lucene/codecs/src/java/org/apache/lucene/codecs/memory/FSTTermsWriter.java
> @@ -41,6 +41,7 @@ import org.apache.lucene.util.BytesRef;
>  import org.apache.lucene.util.FixedBitSet;
>  import org.apache.lucene.util.IOUtils;
>  import org.apache.lucene.util.IntsRefBuilder;
> +import org.apache.lucene.util.Version;
>  import org.apache.lucene.util.fst.FSTCompiler;
>  import org.apache.lucene.util.fst.FST;
>  import org.apache.lucene.util.fst.Util;
> @@ -123,7 +124,7 @@ import org.apache.lucene.util.fst.Util;
>  public class FSTTermsWriter extends FieldsConsumer {
>static final String TERMS_EXTENSION = "tfp";
>static final String TERMS_CODEC_NAME = "FSTTerms";
> -  public static final int TERMS_VERSION_START = 2;
> +  public static final int TERMS_VERSION_START = (Version.LATEST.major << 16) 
> | (Version.LATEST.minor << 8) | Version.LATEST.bugfix;
>public static final int TERMS_VERSION_CURRENT = TERMS_VERSION_START;
>
>final PostingsWriterBase postingsWriter;
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-13020) Additional Hooks in SolrEventListener

2020-02-14 Thread Kevin Jia (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-13020?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17037254#comment-17037254
 ] 

Kevin Jia commented on SOLR-13020:
--

Very well said Akshay. (y)

> Additional Hooks in SolrEventListener
> -
>
> Key: SOLR-13020
> URL: https://issues.apache.org/jira/browse/SOLR-13020
> Project: Solr
>  Issue Type: Improvement
>  Components: Plugin system
>Reporter: Kevin Jia
>Priority: Minor
>
> Add more hooks in SolrEventListener to allow for greater user customization. 
> Proposed hooks are: 
> public void postCoreConstruct(SolrCore core);
> public void preAddDoc(AddUpdateCommand cmd);
> public void postAddDoc(AddUpdateCommand cmd);
> public void preDelete(DeleteUpdateCommand cmd);
> public void postDelete(DeleteUpdateCommand cmd);



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] dweiss commented on issue #1260: SOLR-13669: DIH: Add System property toggle for use of dataConfig param

2020-02-14 Thread GitBox
dweiss commented on issue #1260: SOLR-13669: DIH: Add System property toggle 
for use of dataConfig param
URL: https://github.com/apache/lucene-solr/pull/1260#issuecomment-586438839
 
 
   You need to copy this license (and checksum) from master or 8x?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] HoustonPutman commented on issue #1260: SOLR-13669: DIH: Add System property toggle for use of dataConfig param

2020-02-14 Thread GitBox
HoustonPutman commented on issue #1260: SOLR-13669: DIH: Add System property 
toggle for use of dataConfig param
URL: https://github.com/apache/lucene-solr/pull/1260#issuecomment-586434232
 
 
   I can't `ant precommit`, do to an issue with a missing license in 
`solr/contrib/clustering/lib/simple-xml-safe-2.7.1.jar`.
   
   99% sure the error isn't because of this commit.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-13794) Delete solr/core/src/test-files/solr/configsets/_default

2020-02-14 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-13794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17037209#comment-17037209
 ] 

ASF subversion and git services commented on SOLR-13794:


Commit ea20c9a001bde55dd58f0e48abbd6aa44dc2c858 in lucene-solr's branch 
refs/heads/branch_8x from Chris M. Hostetter
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=ea20c9a ]

SOLR-13794: Replace redundent test only copy of '_default' configset with 
SolrTestCase logic to correctly set 'solr.default.confdir' system property

This change allows us to remove kludgy test only code from ZkController

(cherry picked from commit f549ee353530fcd48390a314aff9ec1723b47346)


> Delete solr/core/src/test-files/solr/configsets/_default
> 
>
> Key: SOLR-13794
> URL: https://issues.apache.org/jira/browse/SOLR-13794
> Project: Solr
>  Issue Type: Test
>Reporter: Chris M. Hostetter
>Assignee: Chris M. Hostetter
>Priority: Major
> Attachments: SOLR-13794.patch, SOLR-13794.patch, 
> SOLR-13794_code_only.patch, SOLR-13794_code_only.patch
>
>
> For as long as we've had a {{_default}} configset in solr, we've also had a 
> copy of that default in {{core/src/test-files/}} - as well as a unit test 
> that confirms they are identical.
> It's never really been clear to me *why* we have this duplication, instead of 
> just having the test-framework take the necessary steps to ensure that 
> {{server/solr/configsets/_default}} is properly used when running tests.
> I'd like to propose we eliminate the duplication since it only ever seems to 
> cause problems (notably spurious test failures when people modify the 
> {{_default}} configset w/o remembering that they need to make identical edits 
> to the {{test-files}} clone) and instead have {{SolrTestCase}} set the 
> (already existing & supported) {{solr.default.confdir}} system property to 
> point to the (already existing) {{ExternalPaths.DEFAULT_CONFIGSET}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] HoustonPutman opened a new pull request #1260: SOLR-13669: DIH: Add System property toggle for use of dataConfig param

2020-02-14 Thread GitBox
HoustonPutman opened a new pull request #1260: SOLR-13669: DIH: Add System 
property toggle for use of dataConfig param
URL: https://github.com/apache/lucene-solr/pull/1260
 
 
   (cherry picked from commit 325824cd391c8e71f36f17d687f52344e50e9715)
   
   Addresses [CVE-2019-0193](https://nvd.nist.gov/vuln/detail/CVE-2019-0193) / 
[SOLR-13669 ](https://issues.apache.org/jira/browse/SOLR-13669)
   
   Backport from Solr 8.1.2
   
   # Checklist
   
   Please review the following and check all that apply:
   
   - [ ] I have reviewed the guidelines for [How to 
Contribute](https://wiki.apache.org/solr/HowToContribute) and my code conforms 
to the standards described there to the best of my ability.
   - [ ] I have created a Jira issue and added the issue ID to my pull request 
title.
   - [ ] I have given Solr maintainers 
[access](https://help.github.com/en/articles/allowing-changes-to-a-pull-request-branch-created-from-a-fork)
 to contribute to my PR branch. (optional but recommended)
   - [ ] I have developed this patch against the `master` branch.
   - [ ] I have run `ant precommit` and the appropriate test suite.
   - [ ] I have added tests for my changes.
   - [ ] I have added documentation for the [Ref 
Guide](https://github.com/apache/lucene-solr/tree/master/solr/solr-ref-guide) 
(for Solr changes only).
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-13794) Delete solr/core/src/test-files/solr/configsets/_default

2020-02-14 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-13794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17037193#comment-17037193
 ] 

ASF subversion and git services commented on SOLR-13794:


Commit f549ee353530fcd48390a314aff9ec1723b47346 in lucene-solr's branch 
refs/heads/master from Chris M. Hostetter
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=f549ee3 ]

SOLR-13794: Replace redundent test only copy of '_default' configset with 
SolrTestCase logic to correctly set 'solr.default.confdir' system property

This change allows us to remove kludgy test only code from ZkController


> Delete solr/core/src/test-files/solr/configsets/_default
> 
>
> Key: SOLR-13794
> URL: https://issues.apache.org/jira/browse/SOLR-13794
> Project: Solr
>  Issue Type: Test
>Reporter: Chris M. Hostetter
>Assignee: Chris M. Hostetter
>Priority: Major
> Attachments: SOLR-13794.patch, SOLR-13794.patch, 
> SOLR-13794_code_only.patch, SOLR-13794_code_only.patch
>
>
> For as long as we've had a {{_default}} configset in solr, we've also had a 
> copy of that default in {{core/src/test-files/}} - as well as a unit test 
> that confirms they are identical.
> It's never really been clear to me *why* we have this duplication, instead of 
> just having the test-framework take the necessary steps to ensure that 
> {{server/solr/configsets/_default}} is properly used when running tests.
> I'd like to propose we eliminate the duplication since it only ever seems to 
> cause problems (notably spurious test failures when people modify the 
> {{_default}} configset w/o remembering that they need to make identical edits 
> to the {{test-files}} clone) and instead have {{SolrTestCase}} set the 
> (already existing & supported) {{solr.default.confdir}} system property to 
> point to the (already existing) {{ExternalPaths.DEFAULT_CONFIGSET}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] dsmiley commented on issue #1191: SOLR-14197 Reduce API of SolrResourceLoader

2020-02-14 Thread GitBox
dsmiley commented on issue #1191: SOLR-14197 Reduce API of SolrResourceLoader
URL: https://github.com/apache/lucene-solr/pull/1191#issuecomment-586399204
 
 
   FYI tomorrow I'll be on vacation for a week with minimal internet access, so 
I may not respond much for a bit.  Again, I think the issue is ready for review.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-13951) Avoid replica state updates to state.json

2020-02-14 Thread David Smiley (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-13951?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17037122#comment-17037122
 ] 

David Smiley commented on SOLR-13951:
-

I'm really happy to see that this will reduce the use of the Overseer.  

Might we consider using Apache Curator for ZK interaction on work here?  My 
understanding is that this won't be a big-bang refactor; instead it'll be 
gradual.  It's already on the classpath (some authentication stuff, 
surprisingly)

> Avoid replica  state updates to state.json
> --
>
> Key: SOLR-13951
> URL: https://issues.apache.org/jira/browse/SOLR-13951
> Project: Solr
>  Issue Type: Bug
>Reporter: Noble Paul
>Assignee: Noble Paul
>Priority: Major
>
> This can dramatically improve the scalability of Solr and minimize load on 
> Overseer
> See this doc for details
> https://docs.google.com/document/d/1FoPVxiVrbfoSpMqZZRGjBy_jrLI26qhWwUO_aQQ0KRQ/edit?usp=sharing



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] tflobbe commented on a change in pull request #1247: SOLR-14252 use double rather than Double to avoid NPE

2020-02-14 Thread GitBox
tflobbe commented on a change in pull request #1247: SOLR-14252 use double 
rather than Double to avoid NPE
URL: https://github.com/apache/lucene-solr/pull/1247#discussion_r379543081
 
 

 ##
 File path: solr/core/src/java/org/apache/solr/metrics/AggregateMetric.java
 ##
 @@ -93,18 +115,11 @@ public double getMax() {
 if (values.isEmpty()) {
   return 0;
 }
-Double res = null;
+double res = 0;
 
 Review comment:
   Maybe @sigram ?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] tflobbe commented on a change in pull request #1247: SOLR-14252 use double rather than Double to avoid NPE

2020-02-14 Thread GitBox
tflobbe commented on a change in pull request #1247: SOLR-14252 use double 
rather than Double to avoid NPE
URL: https://github.com/apache/lucene-solr/pull/1247#discussion_r379542807
 
 

 ##
 File path: solr/core/src/java/org/apache/solr/metrics/AggregateMetric.java
 ##
 @@ -51,6 +61,18 @@ public String toString() {
   ", updateCount=" + updateCount +
   '}';
 }
+
+public double toDouble() {
 
 Review comment:
   Sure, but since this method is new lets make it private from start


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Resolved] (LUCENE-9226) Point2D#relateTriangle bug

2020-02-14 Thread Ignacio Vera (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-9226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ignacio Vera resolved LUCENE-9226.
--
Fix Version/s: 8.5
 Assignee: Ignacio Vera
   Resolution: Fixed

master: ebec456602e480101661cfbd0821fbb397f83d8c
branch_8x: f3c81d76b060f3bdce8cb6dc4c3c8b1cb8fe24cf

I didn't add an entry in CHANGES.txt as it is an unreleased feature

> Point2D#relateTriangle bug
> --
>
> Key: LUCENE-9226
> URL: https://issues.apache.org/jira/browse/LUCENE-9226
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Ignacio Vera
>Assignee: Ignacio Vera
>Priority: Major
> Fix For: 8.5
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> The current implementation returns CELL_INSIDE_QUERY when point lies inside 
> the triangle. It should return CELL_CROSSES_QUERY instead.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] iverase commented on issue #1259: LUCENE-9226: Return CELL_CROSSES_QUERY when point inside the triangle

2020-02-14 Thread GitBox
iverase commented on issue #1259: LUCENE-9226: Return CELL_CROSSES_QUERY when 
point inside the triangle
URL: https://github.com/apache/lucene-solr/pull/1259#issuecomment-586353925
 
 
   I push it as it is trivial


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] iverase merged pull request #1259: LUCENE-9226: Return CELL_CROSSES_QUERY when point inside the triangle

2020-02-14 Thread GitBox
iverase merged pull request #1259: LUCENE-9226: Return CELL_CROSSES_QUERY when 
point inside the triangle
URL: https://github.com/apache/lucene-solr/pull/1259
 
 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Resolved] (LUCENE-9224) (ant) RAT report complains about ... solr/webapp rat-report.xml (from gradle)

2020-02-14 Thread Erick Erickson (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-9224?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Erick Erickson resolved LUCENE-9224.

Fix Version/s: master (9.0)
   Resolution: Fixed

Pushed a fix just to reduce friction...

> (ant) RAT report complains about ... solr/webapp rat-report.xml (from gradle)
> -
>
> Key: LUCENE-9224
> URL: https://issues.apache.org/jira/browse/LUCENE-9224
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Chris M. Hostetter
>Assignee: Erick Erickson
>Priority: Major
> Fix For: master (9.0)
>
> Attachments: LUCENE-9224.patch
>
>
> I'm not sure if this is due to some comflagration of mixing gradle work with 
> ant work, but today i encountered the following failure after running "ant 
> clean precommit" ...
> {noformat}
> rat-sources:
>   [rat] 
>   [rat] *
>   [rat] Summary
>   [rat] ---
>   [rat] Generated at: 2020-02-13T14:46:10-07:00
>   [rat] Notes: 0
>   [rat] Binaries: 1
>   [rat] Archives: 0
>   [rat] Standards: 95
>   [rat] 
>   [rat] Apache Licensed: 75
>   [rat] Generated Documents: 0
>   [rat] 
>   [rat] JavaDocs are generated and so license header is optional
>   [rat] Generated files do not required license headers
>   [rat] 
>   [rat] 1 Unknown Licenses
>   [rat] 
>   [rat] ***
>   [rat] 
>   [rat] Unapproved licenses:
>   [rat] 
>   [rat]   /home/hossman/lucene/dev/solr/webapp/build/rat/rat-report.xml
>   [rat] 
>   [rat] ***
>   [rat] 
>   [rat] Archives:
>   [rat] 
>   [rat] *
>   [rat]   Files with Apache License headers will be marked AL
>   [rat]   Binary files (which do not require AL headers) will be marked B
>   [rat]   Compressed archives will be marked A
>   [rat]   Notices, licenses etc will be marked N
>   [rat]   AL/home/hossman/lucene/dev/solr/webapp/build.xml
>   [rat]  !? 
> /home/hossman/lucene/dev/solr/webapp/build/rat/rat-report.xml
>   [rat]   AL/home/hossman/lucene/dev/solr/webapp/web/WEB-INF/web.xml
> ...
> {noformat}
> RAT seems to be comlaining that there is no license header in it's own report 
> file?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9224) (ant) RAT report complains about ... solr/webapp rat-report.xml (from gradle)

2020-02-14 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17037070#comment-17037070
 ] 

ASF subversion and git services commented on LUCENE-9224:
-

Commit f52676cd82576eb6231114ffc74bf9a653f92953 in lucene-solr's branch 
refs/heads/master from Erick Erickson
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=f52676c ]

LUCENE-9224: (ant) RAT report complains about ... solr/webapp rat-report.xml 
(from gradle)


> (ant) RAT report complains about ... solr/webapp rat-report.xml (from gradle)
> -
>
> Key: LUCENE-9224
> URL: https://issues.apache.org/jira/browse/LUCENE-9224
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Chris M. Hostetter
>Assignee: Erick Erickson
>Priority: Major
> Attachments: LUCENE-9224.patch
>
>
> I'm not sure if this is due to some comflagration of mixing gradle work with 
> ant work, but today i encountered the following failure after running "ant 
> clean precommit" ...
> {noformat}
> rat-sources:
>   [rat] 
>   [rat] *
>   [rat] Summary
>   [rat] ---
>   [rat] Generated at: 2020-02-13T14:46:10-07:00
>   [rat] Notes: 0
>   [rat] Binaries: 1
>   [rat] Archives: 0
>   [rat] Standards: 95
>   [rat] 
>   [rat] Apache Licensed: 75
>   [rat] Generated Documents: 0
>   [rat] 
>   [rat] JavaDocs are generated and so license header is optional
>   [rat] Generated files do not required license headers
>   [rat] 
>   [rat] 1 Unknown Licenses
>   [rat] 
>   [rat] ***
>   [rat] 
>   [rat] Unapproved licenses:
>   [rat] 
>   [rat]   /home/hossman/lucene/dev/solr/webapp/build/rat/rat-report.xml
>   [rat] 
>   [rat] ***
>   [rat] 
>   [rat] Archives:
>   [rat] 
>   [rat] *
>   [rat]   Files with Apache License headers will be marked AL
>   [rat]   Binary files (which do not require AL headers) will be marked B
>   [rat]   Compressed archives will be marked A
>   [rat]   Notices, licenses etc will be marked N
>   [rat]   AL/home/hossman/lucene/dev/solr/webapp/build.xml
>   [rat]  !? 
> /home/hossman/lucene/dev/solr/webapp/build/rat/rat-report.xml
>   [rat]   AL/home/hossman/lucene/dev/solr/webapp/web/WEB-INF/web.xml
> ...
> {noformat}
> RAT seems to be comlaining that there is no license header in it's own report 
> file?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8321) Allow composite readers to have more than 2B documents

2020-02-14 Thread Erick Erickson (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-8321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17037035#comment-17037035
 ] 

Erick Erickson commented on LUCENE-8321:


Part of the rabbit hole would be the number of segments. TMP has a default 
segment size cap of 5G for instance. We could certainly up that or create a new 
merge policy for indexes with lots of docs...

On a separate note I've seen instances of terabyte-scale indexes on disk. 
Allowing that to grow by a factor of 8 would be another part of the rabbit hole.

That said, I'm not against the idea at all. I'm pretty sure operational issues 
would pop out, but that's progress...

 

> Allow composite readers to have more than 2B documents
> --
>
> Key: LUCENE-8321
> URL: https://issues.apache.org/jira/browse/LUCENE-8321
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Priority: Minor
>
> I would like to start discussing removing the limit of ~2B documents that we 
> have for indices, while still enforcing it at the segment level for practical 
> reasons.
> Postings, stored fields, and all other codec APIs would keep working on 
> integers to represent doc ids. Only top-level doc ids and numbers of 
> documents would need to move to a long. I say "only" because we now mostly 
> consume indices per-segment, but there is still a number of places where we 
> identify documents by their top-level doc ID like {{IndexReader#document}}, 
> top-docs collectors, etc.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] iverase opened a new pull request #1259: LUCENE-9226: Return CELL_CROSSES_QUERY when point inside the triangle

2020-02-14 Thread GitBox
iverase opened a new pull request #1259: LUCENE-9226: Return CELL_CROSSES_QUERY 
when point inside the triangle
URL: https://github.com/apache/lucene-solr/pull/1259
 
 
   return the right relation when point side a triangle.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Created] (LUCENE-9226) Point2D#relateTriangle bug

2020-02-14 Thread Ignacio Vera (Jira)
Ignacio Vera created LUCENE-9226:


 Summary: Point2D#relateTriangle bug
 Key: LUCENE-9226
 URL: https://issues.apache.org/jira/browse/LUCENE-9226
 Project: Lucene - Core
  Issue Type: Bug
Reporter: Ignacio Vera


The current implementation returns CELL_INSIDE_QUERY when point lies inside the 
triangle. It should return CELL_CROSSES_QUERY instead.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] jpountz commented on a change in pull request #1234: Add compression for Binary doc value fields

2020-02-14 Thread GitBox
jpountz commented on a change in pull request #1234: Add compression for Binary 
doc value fields
URL: https://github.com/apache/lucene-solr/pull/1234#discussion_r379473860
 
 

 ##
 File path: 
lucene/core/src/java/org/apache/lucene/codecs/lucene80/Lucene80DocValuesProducer.java
 ##
 @@ -742,6 +757,125 @@ public BytesRef binaryValue() throws IOException {
 };
   }
 }
+  }  
+  
+  // Decompresses blocks of binary values to retrieve content
+  class BinaryDecoder {
+
+private final LongValues addresses;
+private final IndexInput compressedData;
+// Cache of last uncompressed block 
+private long lastBlockId = -1;
+private final int []uncompressedDocStarts;
+private int uncompressedBlockLength = 0;
+private final byte[] uncompressedBlock;
+private final BytesRef uncompressedBytesRef;
+private final int docsPerChunk;
+
+public BinaryDecoder(LongValues addresses, IndexInput compressedData, int 
biggestUncompressedBlockSize, int docsPerChunk) {
+  super();
+  this.addresses = addresses;
+  this.compressedData = compressedData;
+  // pre-allocate a byte array large enough for the biggest uncompressed 
block needed.
+  this.uncompressedBlock = new byte[biggestUncompressedBlockSize];
+  uncompressedBytesRef = new BytesRef(uncompressedBlock);
+  this.docsPerChunk = docsPerChunk;
+  uncompressedDocStarts = new int[docsPerChunk + 1];
+  
+}
+
+BytesRef decode(int docNumber) throws IOException {
+  int blockId = docNumber >> Lucene80DocValuesFormat.BINARY_BLOCK_SHIFT; 
 
 Review comment:
   I think so.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] markharwood commented on a change in pull request #1234: Add compression for Binary doc value fields

2020-02-14 Thread GitBox
markharwood commented on a change in pull request #1234: Add compression for 
Binary doc value fields
URL: https://github.com/apache/lucene-solr/pull/1234#discussion_r379463675
 
 

 ##
 File path: 
lucene/core/src/java/org/apache/lucene/codecs/lucene80/Lucene80DocValuesProducer.java
 ##
 @@ -182,10 +183,21 @@ private BinaryEntry readBinary(ChecksumIndexInput meta) 
throws IOException {
 entry.numDocsWithField = meta.readInt();
 entry.minLength = meta.readInt();
 entry.maxLength = meta.readInt();
-if (entry.minLength < entry.maxLength) {
+if ((version >= Lucene80DocValuesFormat.VERSION_BIN_COMPRESSED && 
entry.numDocsWithField >0)||  entry.minLength < entry.maxLength) {
   entry.addressesOffset = meta.readLong();
+
+  // Old count of uncompressed addresses 
+  long numAddresses = entry.numDocsWithField + 1L;
+  // New count of compressed addresses - the number of compresseed blocks
+  if (version >= Lucene80DocValuesFormat.VERSION_BIN_COMPRESSED) {
+entry.numCompressedChunks = meta.readVInt();
+entry.docsPerChunk = meta.readVInt();
 
 Review comment:
   Ah - ignore my previous comment.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] markharwood commented on a change in pull request #1234: Add compression for Binary doc value fields

2020-02-14 Thread GitBox
markharwood commented on a change in pull request #1234: Add compression for 
Binary doc value fields
URL: https://github.com/apache/lucene-solr/pull/1234#discussion_r379463440
 
 

 ##
 File path: 
lucene/core/src/java/org/apache/lucene/codecs/lucene80/Lucene80DocValuesProducer.java
 ##
 @@ -742,6 +757,125 @@ public BytesRef binaryValue() throws IOException {
 };
   }
 }
+  }  
+  
+  // Decompresses blocks of binary values to retrieve content
+  class BinaryDecoder {
+
+private final LongValues addresses;
+private final IndexInput compressedData;
+// Cache of last uncompressed block 
+private long lastBlockId = -1;
+private final int []uncompressedDocStarts;
+private int uncompressedBlockLength = 0;
+private final byte[] uncompressedBlock;
+private final BytesRef uncompressedBytesRef;
+private final int docsPerChunk;
+
+public BinaryDecoder(LongValues addresses, IndexInput compressedData, int 
biggestUncompressedBlockSize, int docsPerChunk) {
+  super();
+  this.addresses = addresses;
+  this.compressedData = compressedData;
+  // pre-allocate a byte array large enough for the biggest uncompressed 
block needed.
+  this.uncompressedBlock = new byte[biggestUncompressedBlockSize];
+  uncompressedBytesRef = new BytesRef(uncompressedBlock);
+  this.docsPerChunk = docsPerChunk;
+  uncompressedDocStarts = new int[docsPerChunk + 1];
+  
+}
+
+BytesRef decode(int docNumber) throws IOException {
+  int blockId = docNumber >> Lucene80DocValuesFormat.BINARY_BLOCK_SHIFT; 
 
 Review comment:
   I guess that means I should serialize the shift value rather the absolute 
number of docs per block?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (SOLR-10306) Document vm.swappiness and swapoff in RefGuide

2020-02-14 Thread Jira


 [ 
https://issues.apache.org/jira/browse/SOLR-10306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jan Høydahl updated SOLR-10306:
---
Summary: Document vm.swappiness and swapoff in RefGuide  (was: Document 
vm.swappiness and mlockall in RefGuide)

> Document vm.swappiness and swapoff in RefGuide
> --
>
> Key: SOLR-10306
> URL: https://issues.apache.org/jira/browse/SOLR-10306
> Project: Solr
>  Issue Type: Task
>  Components: documentation
>Reporter: Jan Høydahl
>Assignee: Jan Høydahl
>Priority: Minor
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> I think we should document sane best practice OS level settings in the ref 
> guide, e.g. in 
> https://cwiki.apache.org/confluence/display/solr/Taking+Solr+to+Production
> Such as lower system swappiness or ability to use mlockall (like this ES page 
> https://www.elastic.co/guide/en/elasticsearch/guide/current/heap-sizing.html)
> I also found this github repo https://github.com/LucidWorks/mlockall-agent - 
> it is old, did anyone have good experience with the agent for locking Solr's 
> memory?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] janhoy commented on issue #1256: SOLR-10306: Document in Reference Guide how to disable or reduce swapping

2020-02-14 Thread GitBox
janhoy commented on issue #1256: SOLR-10306: Document in Reference Guide how to 
disable or reduce swapping
URL: https://github.com/apache/lucene-solr/pull/1256#issuecomment-586303086
 
 
   Done some changes. There is probably some sub-par language as well, don't be 
afraid to shoot me down :) 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] alessandrobenedetti commented on issue #357: [SOLR-12238] Synonym Queries boost

2020-02-14 Thread GitBox
alessandrobenedetti commented on issue #357: [SOLR-12238] Synonym Queries boost
URL: https://github.com/apache/lucene-solr/pull/357#issuecomment-586284506
 
 
   As I was writing the blog I thought that the Lucene changes would 
automatically bring the same feature to Elasticsearch, but then I kinda ended 
up in an another analysis common package in Elasticsearch that is a duplicated 
of the Lucene one (but still it imports the Lucene stuff).
   
   
https://github.com/elastic/elasticsearch/blob/24e1858a70bd255ebc210415acaac1bfb40340d3/modules/analysis-common/src/main/java/org/elasticsearch/analysis/common/CommonAnalysisPlugin.java
   
   Am I correct to say that this feature won't appear automatically in Lucene 
unless we create a DelimitedBoostTokenFilterFactory there as well?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] iverase commented on a change in pull request #587: LUCENE-8707: Add LatLonShape and XYShape distance query

2020-02-14 Thread GitBox
iverase commented on a change in pull request #587: LUCENE-8707: Add 
LatLonShape and XYShape distance query
URL: https://github.com/apache/lucene-solr/pull/587#discussion_r379411676
 
 

 ##
 File path: lucene/core/src/java/org/apache/lucene/geo/Circle2D.java
 ##
 @@ -0,0 +1,463 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.geo;
+
+import org.apache.lucene.index.PointValues.Relation;
+import org.apache.lucene.util.SloppyMath;
+
+/**
+ * 2D circle implementation containing spatial logic.
+ */
+class Circle2D implements Component2D {
+
+  private final DistanceCalculator calculator;
+
+  private Circle2D(DistanceCalculator calculator) {
+this.calculator = calculator;
+  }
+
+  @Override
+  public double getMinX() {
+return calculator.getMinX();
+  }
+
+  @Override
+  public double getMaxX() {
+return calculator.getMaxX();
+  }
+
+  @Override
+  public double getMinY() {
+return calculator.getMinY();
+  }
+
+  @Override
+  public double getMaxY() {
+return calculator.getMaxY();
+  }
+
+  @Override
+  public boolean contains(double x, double y) {
+return calculator.contains(x, y);
+  }
+
+  @Override
+  public Relation relate(double minX, double maxX, double minY, double maxY) {
+if (calculator.disjoint(minX, maxX, minY, maxY)) {
+  return Relation.CELL_OUTSIDE_QUERY;
+}
+if (calculator.within(minX, maxX, minY, maxY)) {
+  return Relation.CELL_CROSSES_QUERY;
+}
+return calculator.relate(minX, maxX, minY, maxY);
+  }
+
+  @Override
+  public Relation relateTriangle(double minX, double maxX, double minY, double 
maxY,
+ double ax, double ay, double bx, double by, 
double cx, double cy) {
+if (calculator.disjoint(minX, maxX, minY, maxY)) {
+  return Relation.CELL_OUTSIDE_QUERY;
+}
+if (ax == bx && bx == cx && ay == by && by == cy) {
+  // indexed "triangle" is a point: shortcut by checking contains
+  return contains(ax, ay) ? Relation.CELL_INSIDE_QUERY : 
Relation.CELL_OUTSIDE_QUERY;
+} else if (ax == cx && ay == cy) {
+  // indexed "triangle" is a line segment: shortcut by calling appropriate 
method
+  return relateIndexedLineSegment(ax, ay, bx, by);
+} else if (ax == bx && ay == by) {
+  // indexed "triangle" is a line segment: shortcut by calling appropriate 
method
+  return relateIndexedLineSegment(bx, by, cx, cy);
+} else if (bx == cx && by == cy) {
+  // indexed "triangle" is a line segment: shortcut by calling appropriate 
method
+  return relateIndexedLineSegment(cx, cy, ax, ay);
+}
+// indexed "triangle" is a triangle:
+return relateIndexedTriangle(minX, maxX, minY, maxY, ax, ay, bx, by, cx, 
cy);
+  }
+
+  @Override
+  public WithinRelation withinTriangle(double minX, double maxX, double minY, 
double maxY,
+   double ax, double ay, boolean ab, 
double bx, double by, boolean bc, double cx, double cy, boolean ca) {
+// short cut, lines and points cannot contain this type of shape
+if ((ax == bx && ay == by) || (ax == cx && ay == cy) || (bx == cx && by == 
cy)) {
+  return WithinRelation.DISJOINT;
 
 Review comment:
   Strictly speaking yes but a point or a line would never return CANDIDATE for 
a circle. I guess I am using disjoint because is the cheaper answer (the shape 
is actually ignored) but I agree this is probably wrong.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9187) remove too-expensive assert from LZ4 HighCompressionHashTable

2020-02-14 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9187?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17036955#comment-17036955
 ] 

ASF subversion and git services commented on LUCENE-9187:
-

Commit 210f2f83f7c637a8c0f96fe563ab8b49fa18dfd9 in lucene-solr's branch 
refs/heads/branch_8x from Adrien Grand
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=210f2f8 ]

Add back assertions removed by LUCENE-9187. (#1236)

This time they would only apply to TestFastLZ4/TestHighLZ4 and avoid slowing
down all tests.



> remove too-expensive assert from LZ4 HighCompressionHashTable
> -
>
> Key: LUCENE-9187
> URL: https://issues.apache.org/jira/browse/LUCENE-9187
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Robert Muir
>Priority: Major
> Fix For: master (9.0)
>
> Attachments: LUCENE-9187.patch
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> This is the slowest method in the lucene tests. See LUCENE-9185 for what I 
> mean.
> If you look at it, its checking 64k values every time the assert is called.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9187) remove too-expensive assert from LZ4 HighCompressionHashTable

2020-02-14 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9187?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17036954#comment-17036954
 ] 

ASF subversion and git services commented on LUCENE-9187:
-

Commit da33e4aa6f92ee622a9605636bc5fe1c105f1f94 in lucene-solr's branch 
refs/heads/branch_8x from Robert Muir
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=da33e4a ]

LUCENE-9187: remove too-expensive assert from LZ4 HighCompressionHashTable


> remove too-expensive assert from LZ4 HighCompressionHashTable
> -
>
> Key: LUCENE-9187
> URL: https://issues.apache.org/jira/browse/LUCENE-9187
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Robert Muir
>Priority: Major
> Fix For: master (9.0)
>
> Attachments: LUCENE-9187.patch
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> This is the slowest method in the lucene tests. See LUCENE-9185 for what I 
> mean.
> If you look at it, its checking 64k values every time the assert is called.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] andywebb1975 commented on issue #1247: SOLR-14252 use double rather than Double to avoid NPE

2020-02-14 Thread GitBox
andywebb1975 commented on issue #1247: SOLR-14252 use double rather than Double 
to avoid NPE
URL: https://github.com/apache/lucene-solr/pull/1247#issuecomment-586263732
 
 
   Note it's reporter config like the below that triggers the exceptions - the 
cache metrics include `LocalStatsCache` values:
   
   ```
   
 10
 UPDATE\./update/.*requests
 QUERY\./select.*requests
 CACHE\.searcher.*
   
   ```


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] andywebb1975 commented on a change in pull request #1247: SOLR-14252 use double rather than Double to avoid NPE

2020-02-14 Thread GitBox
andywebb1975 commented on a change in pull request #1247: SOLR-14252 use double 
rather than Double to avoid NPE
URL: https://github.com/apache/lucene-solr/pull/1247#discussion_r379399436
 
 

 ##
 File path: solr/core/src/java/org/apache/solr/metrics/AggregateMetric.java
 ##
 @@ -51,6 +61,18 @@ public String toString() {
   ", updateCount=" + updateCount +
   '}';
 }
+
+public double toDouble() {
 
 Review comment:
   I don't think so, no - and I think some/all of the the other `public` things 
in these classes could be made less accessible too.  Happy to take a look at 
this, but would like to make sure the metrics aggregation is working correctly 
first.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] andywebb1975 commented on a change in pull request #1247: SOLR-14252 use double rather than Double to avoid NPE

2020-02-14 Thread GitBox
andywebb1975 commented on a change in pull request #1247: SOLR-14252 use double 
rather than Double to avoid NPE
URL: https://github.com/apache/lucene-solr/pull/1247#discussion_r379398056
 
 

 ##
 File path: solr/core/src/java/org/apache/solr/metrics/AggregateMetric.java
 ##
 @@ -51,6 +61,18 @@ public String toString() {
   ", updateCount=" + updateCount +
   '}';
 }
+
+public double toDouble() {
+  if (value instanceof Boolean) {
+return 0;
+  } 
+  if (!(value instanceof Number)) {
+log.debug("not a Number: " + value);
 
 Review comment:
   yes - if this logging is useful, I may just remove it instead.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] andywebb1975 commented on a change in pull request #1247: SOLR-14252 use double rather than Double to avoid NPE

2020-02-14 Thread GitBox
andywebb1975 commented on a change in pull request #1247: SOLR-14252 use double 
rather than Double to avoid NPE
URL: https://github.com/apache/lucene-solr/pull/1247#discussion_r379397661
 
 

 ##
 File path: solr/core/src/java/org/apache/solr/metrics/AggregateMetric.java
 ##
 @@ -93,18 +115,11 @@ public double getMax() {
 if (values.isEmpty()) {
   return 0;
 }
-Double res = null;
+double res = 0;
 
 Review comment:
   That's a good point - I don't know. (Who might know?)
   
   It may be better to switch to creating a list of numerical values then 
finding its min/max/mean/etc (ideally using standard library functions) - but 
this may be overkill.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] iverase opened a new pull request #1258: LUCENE-9225: Rectangle should extend LatLonGeometry

2020-02-14 Thread GitBox
iverase opened a new pull request #1258: LUCENE-9225: Rectangle should extend 
LatLonGeometry
URL: https://github.com/apache/lucene-solr/pull/1258
 
 
   Rectangle now extends LatLonGeometry so it can be used as part of a geometry 
collection. We need to be careful for Contains and we need to split the 
rectangle in two if it crossest the dateline. 
   
   Test is added to check we get the same results from tLatLotBoundingBoxQuery 
and the corresponding geometry query.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Created] (LUCENE-9225) Rectangle should extend LatLonGeometry

2020-02-14 Thread Ignacio Vera (Jira)
Ignacio Vera created LUCENE-9225:


 Summary: Rectangle should extend LatLonGeometry
 Key: LUCENE-9225
 URL: https://issues.apache.org/jira/browse/LUCENE-9225
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Ignacio Vera


Rectangle class is the only geometry class that do not extend LatLonGeometry. 
This is because we have an specialise query for rectangles that works on the 
encoded space (very similar to what LatLonPoint is doing).

It would be nice if Rectangle could implement LatLonGeometry, so in cases where 
a bounding box is part of a complex geometry, it can fall back to Component2D 
objects. 

The idea is to move the specialise logic in Rectangle2D inside the specialised 
LatLonBoundingBoxQuery and rename the current XYRectangle2D to Rectangle2D.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Comment Edited] (LUCENE-9217) Add validation for XYGeometries

2020-02-14 Thread Ignacio Vera (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9217?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17036898#comment-17036898
 ] 

Ignacio Vera edited comment on LUCENE-9217 at 2/14/20 10:44 AM:


fixed in LUCENE-9218


was (Author: ivera):
resolve in LUCENE-9218

> Add validation for XYGeometries
> ---
>
> Key: LUCENE-9217
> URL: https://issues.apache.org/jira/browse/LUCENE-9217
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Ignacio Vera
>Assignee: Ignacio Vera
>Priority: Minor
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> Currently when creating XY geometries, there is no proper validation, in 
> particular checks for NaN, INF and -INF value which should not be allowed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Resolved] (LUCENE-9217) Add validation for XYGeometries

2020-02-14 Thread Ignacio Vera (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-9217?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ignacio Vera resolved LUCENE-9217.
--
  Assignee: Ignacio Vera
Resolution: Won't Fix

resolve in LUCENE-9218

> Add validation for XYGeometries
> ---
>
> Key: LUCENE-9217
> URL: https://issues.apache.org/jira/browse/LUCENE-9217
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Ignacio Vera
>Assignee: Ignacio Vera
>Priority: Minor
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> Currently when creating XY geometries, there is no proper validation, in 
> particular checks for NaN, INF and -INF value which should not be allowed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Resolved] (LUCENE-9218) XYGeometries should use floats instead of doubles

2020-02-14 Thread Ignacio Vera (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-9218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ignacio Vera resolved LUCENE-9218.
--
Fix Version/s: 8.5
 Assignee: Ignacio Vera
   Resolution: Fixed

> XYGeometries should use floats instead of doubles
> -
>
> Key: LUCENE-9218
> URL: https://issues.apache.org/jira/browse/LUCENE-9218
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Ignacio Vera
>Assignee: Ignacio Vera
>Priority: Minor
> Fix For: 8.5
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> XYGeometries (XYPolygon, XYLine, XYRectangle & XYPoint) are a bit 
> counter-intuitive. Where most of them are initialised using floats, when 
> returning those values, they are returned as doubles. In addition XYRectangle 
> seems to work on doubles.
> In this issue it is proposed to harmonise those classes to only work on 
> floats. As these classes were just move to core and they have not been 
> released, it should be ok to change its interfaces.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] iverase closed pull request #1249: LUCENE-9217: Add validation to XYGeometries

2020-02-14 Thread GitBox
iverase closed pull request #1249: LUCENE-9217: Add validation to XYGeometries
URL: https://github.com/apache/lucene-solr/pull/1249
 
 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] iverase commented on issue #1249: LUCENE-9217: Add validation to XYGeometries

2020-02-14 Thread GitBox
iverase commented on issue #1249: LUCENE-9217: Add validation to XYGeometries
URL: https://github.com/apache/lucene-solr/pull/1249#issuecomment-586204933
 
 
   fixed in #1252 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9218) XYGeometries should use floats instead of doubles

2020-02-14 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17036895#comment-17036895
 ] 

ASF subversion and git services commented on LUCENE-9218:
-

Commit ca3319cdbce14120291b1dfcda94d25abd677276 in lucene-solr's branch 
refs/heads/branch_8x from Ignacio Vera
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=ca3319c ]

LUCENE-9218: XYGeometries should expose values as floats (#1252)



> XYGeometries should use floats instead of doubles
> -
>
> Key: LUCENE-9218
> URL: https://issues.apache.org/jira/browse/LUCENE-9218
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Ignacio Vera
>Priority: Minor
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> XYGeometries (XYPolygon, XYLine, XYRectangle & XYPoint) are a bit 
> counter-intuitive. Where most of them are initialised using floats, when 
> returning those values, they are returned as doubles. In addition XYRectangle 
> seems to work on doubles.
> In this issue it is proposed to harmonise those classes to only work on 
> floats. As these classes were just move to core and they have not been 
> released, it should be ok to change its interfaces.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] iverase merged pull request #1252: LUCENE-9218: XYGeometries should expose values as floats

2020-02-14 Thread GitBox
iverase merged pull request #1252: LUCENE-9218: XYGeometries should expose 
values as floats
URL: https://github.com/apache/lucene-solr/pull/1252
 
 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9218) XYGeometries should use floats instead of doubles

2020-02-14 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17036893#comment-17036893
 ] 

ASF subversion and git services commented on LUCENE-9218:
-

Commit 4a54ffb553ad1e45da147dd93f63a20bb4564c91 in lucene-solr's branch 
refs/heads/master from Ignacio Vera
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=4a54ffb ]

LUCENE-9218: XYGeometries should expose values as floats (#1252)



> XYGeometries should use floats instead of doubles
> -
>
> Key: LUCENE-9218
> URL: https://issues.apache.org/jira/browse/LUCENE-9218
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Ignacio Vera
>Priority: Minor
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> XYGeometries (XYPolygon, XYLine, XYRectangle & XYPoint) are a bit 
> counter-intuitive. Where most of them are initialised using floats, when 
> returning those values, they are returned as doubles. In addition XYRectangle 
> seems to work on doubles.
> In this issue it is proposed to harmonise those classes to only work on 
> floats. As these classes were just move to core and they have not been 
> released, it should be ok to change its interfaces.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9203) Make DocValuesIterator public

2020-02-14 Thread Adrien Grand (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9203?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17036878#comment-17036878
 ] 

Adrien Grand commented on LUCENE-9203:
--

Yes indeed, I remember another instance of it which I can't find again that 
only needed this class because it was trying to stuff everything into a base 
class when composition instead of inheritance would have made things even 
simpler.  I'm not completely opposed to making it public but I'd like to see a 
compelling use-case for it. You mentioned consuming all doc values in a generic 
manner but the couple cases I've seen were better served by using 
nextDoc/advance than advanceExact? We don't give ways to consume values in a 
generic manner either but I've not seen many asks for it?

> Make DocValuesIterator public
> -
>
> Key: LUCENE-9203
> URL: https://issues.apache.org/jira/browse/LUCENE-9203
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Affects Versions: 8.4
>Reporter: juan camilo rodriguez duran
>Priority: Trivial
>  Labels: docValues
>
> By doing this, we improve extensibility for new formats. Additionally this 
> will improve coherence with the public method already existent in the class.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9187) remove too-expensive assert from LZ4 HighCompressionHashTable

2020-02-14 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9187?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17036845#comment-17036845
 ] 

ASF subversion and git services commented on LUCENE-9187:
-

Commit 5cbe58f22c71cfe3f6d21b4661914c255ac80e3d in lucene-solr's branch 
refs/heads/master from Adrien Grand
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=5cbe58f ]

Add back assertions removed by LUCENE-9187. (#1236)

This time they would only apply to TestFastLZ4/TestHighLZ4 and avoid slowing
down all tests.



> remove too-expensive assert from LZ4 HighCompressionHashTable
> -
>
> Key: LUCENE-9187
> URL: https://issues.apache.org/jira/browse/LUCENE-9187
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Robert Muir
>Priority: Major
> Fix For: master (9.0)
>
> Attachments: LUCENE-9187.patch
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> This is the slowest method in the lucene tests. See LUCENE-9185 for what I 
> mean.
> If you look at it, its checking 64k values every time the assert is called.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] jpountz merged pull request #1236: Add back assertions removed by LUCENE-9187.

2020-02-14 Thread GitBox
jpountz merged pull request #1236: Add back assertions removed by LUCENE-9187.
URL: https://github.com/apache/lucene-solr/pull/1236
 
 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Resolved] (LUCENE-8762) Lucene50PostingsReader should specialize reading docs+freqs with impacts

2020-02-14 Thread Adrien Grand (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-8762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrien Grand resolved LUCENE-8762.
--
Resolution: Won't Fix

This has been implemented in the new Lucene84PostingsFormat.

> Lucene50PostingsReader should specialize reading docs+freqs with impacts
> 
>
> Key: LUCENE-8762
> URL: https://issues.apache.org/jira/browse/LUCENE-8762
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Priority: Minor
>  Time Spent: 2.5h
>  Remaining Estimate: 0h
>
> Currently if you ask for impacts, we only have one implementation that is 
> able to expose everything: docs, freqs, positions and offsets. In contrast, 
> if you don't need impacts, we have specialization for docs+freqs, 
> docs+freqs+positions and docs+freqs+positions+offsets.
> Maybe we should add specialization for the docs+freqs case with impacts, 
> which should be the most common case, and remove specialization for 
> docs+freqs+positions when impacts are not requested?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] jpountz commented on a change in pull request #1214: LUCENE-9074: Slice Allocation Control Plane For Concurrent Searches

2020-02-14 Thread GitBox
jpountz commented on a change in pull request #1214: LUCENE-9074: Slice 
Allocation Control Plane For Concurrent Searches
URL: https://github.com/apache/lucene-solr/pull/1214#discussion_r379320715
 
 

 ##
 File path: lucene/core/src/java/org/apache/lucene/search/IndexSearcher.java
 ##
 @@ -210,12 +216,27 @@ public IndexSearcher(IndexReader r, Executor executor) {
   public IndexSearcher(IndexReaderContext context, Executor executor) {
 assert context.isTopLevel: "IndexSearcher's ReaderContext must be topLevel 
for reader" + context.reader();
 reader = context.reader();
-this.executor = executor;
+this.sliceExecutionControlPlane = getSliceExecutionControlPlane(executor);
 this.readerContext = context;
 leafContexts = context.leaves();
 this.leafSlices = executor == null ? null : slices(leafContexts);
   }
 
+  /**
+   * We do this elaborate dance as to have only one constructor with a 
nullable second parameter
+   * See the next constructor for more clarification
+   * Only for testing
+   */
+  IndexSearcher(IndexReaderContext context, Executor executor,
 
 Review comment:
   this doesn't need an executor, does it?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] jpountz commented on a change in pull request #1214: LUCENE-9074: Slice Allocation Control Plane For Concurrent Searches

2020-02-14 Thread GitBox
jpountz commented on a change in pull request #1214: LUCENE-9074: Slice 
Allocation Control Plane For Concurrent Searches
URL: https://github.com/apache/lucene-solr/pull/1214#discussion_r379320286
 
 

 ##
 File path: 
lucene/core/src/java/org/apache/lucene/search/DefaultExecutionControlPlane.java
 ##
 @@ -0,0 +1,96 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.search;
+
+import java.util.ArrayList;
+import java.util.Collection;
+import java.util.List;
+import java.util.concurrent.CompletableFuture;
+import java.util.concurrent.Executor;
+import java.util.concurrent.Future;
+import java.util.concurrent.FutureTask;
+import java.util.concurrent.RejectedExecutionException;
+
+/**
+ * Default implementation of SliceExecutionControlPlane which executes 
FutureTask instances. This is used
+ * by IndexSearcher as default unless overridden by a custom implementation.
+ */
+public class DefaultExecutionControlPlane implements 
SliceExecutionControlPlane, FutureTask> {
+  private final Executor executor;
+
+  public DefaultExecutionControlPlane(Executor executor) {
+assert executor != null;
+this.executor = executor;
+  }
+
+  @Override
+  public List invokeAll(Collection tasks) {
+List futures = new ArrayList();
+int i = 0;
+
+for (FutureTask task : tasks) {
+  boolean shouldExecuteOnCallerThread = false;
+
+  // Execute last task on caller thread
+  if (i == tasks.size() - 1) {
+shouldExecuteOnCallerThread = true;
+  }
+
+  processTask(task, futures, shouldExecuteOnCallerThread);
+
+  i++;
+}
+
+return futures;
+  }
+
+  // Helper method to execute a single task
+  protected void processTask(FutureTask task, List futures,
+ boolean shouldExecuteOnCallerThread) {
+if (task == null) {
+  throw new IllegalArgumentException("Input is null");
+}
+
+if (!shouldExecuteOnCallerThread) {
+  try {
+executor.execute(task);
+  } catch (RejectedExecutionException e) {
+// Execute on caller thread
+shouldExecuteOnCallerThread = true;
+  }
+}
+
+if (shouldExecuteOnCallerThread) {
+  try {
+task.run();
+  } catch (Exception e) {
+throw new RuntimeException(e.getMessage());
+  }
+}
+
+if (!shouldExecuteOnCallerThread) {
+  futures.add(task);
+} else {
+  try {
+futures.add(CompletableFuture.completedFuture(task.get()));
+  } catch (Exception e) {
+throw new RuntimeException(e.getMessage());
 
 Review comment:
   please preserve the stack trace of the cause


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] jpountz commented on a change in pull request #1214: LUCENE-9074: Slice Allocation Control Plane For Concurrent Searches

2020-02-14 Thread GitBox
jpountz commented on a change in pull request #1214: LUCENE-9074: Slice 
Allocation Control Plane For Concurrent Searches
URL: https://github.com/apache/lucene-solr/pull/1214#discussion_r379322585
 
 

 ##
 File path: 
lucene/core/src/java/org/apache/lucene/search/SliceExecutionControlPlane.java
 ##
 @@ -0,0 +1,35 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.search;
+
+import java.util.Collection;
+
+
+/**
+ * Execution control plane which is responsible
+ * for execution of slices based on the current status
+ * of the system and current system load
+ */
+public interface SliceExecutionControlPlane  {
+  /**
+   * Invoke all slices that are allocated for the query
+   */
+  C invokeAll(Collection tasks);
+}
 
 Review comment:
   the generics on this interface look over-engineered?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] jpountz commented on a change in pull request #1214: LUCENE-9074: Slice Allocation Control Plane For Concurrent Searches

2020-02-14 Thread GitBox
jpountz commented on a change in pull request #1214: LUCENE-9074: Slice 
Allocation Control Plane For Concurrent Searches
URL: https://github.com/apache/lucene-solr/pull/1214#discussion_r379318977
 
 

 ##
 File path: 
lucene/test-framework/src/java/org/apache/lucene/util/LuceneTestCase.java
 ##
 @@ -113,38 +139,10 @@
 import org.junit.ClassRule;
 import org.junit.Rule;
 import org.junit.Test;
+import org.junit.internal.AssumptionViolatedException;
 import org.junit.rules.RuleChain;
 import org.junit.rules.TestRule;
 import org.junit.runner.RunWith;
-import org.junit.internal.AssumptionViolatedException;
-
-import com.carrotsearch.randomizedtesting.JUnit4MethodProvider;
-import com.carrotsearch.randomizedtesting.LifecycleScope;
-import com.carrotsearch.randomizedtesting.MixWithSuiteName;
-import com.carrotsearch.randomizedtesting.RandomizedContext;
-import com.carrotsearch.randomizedtesting.RandomizedRunner;
-import com.carrotsearch.randomizedtesting.RandomizedTest;
-import com.carrotsearch.randomizedtesting.annotations.Listeners;
-import com.carrotsearch.randomizedtesting.annotations.SeedDecorators;
-import com.carrotsearch.randomizedtesting.annotations.TestGroup;
-import com.carrotsearch.randomizedtesting.annotations.TestMethodProviders;
-import com.carrotsearch.randomizedtesting.annotations.ThreadLeakAction.Action;
-import com.carrotsearch.randomizedtesting.annotations.ThreadLeakAction;
-import com.carrotsearch.randomizedtesting.annotations.ThreadLeakFilters;
-import com.carrotsearch.randomizedtesting.annotations.ThreadLeakGroup.Group;
-import com.carrotsearch.randomizedtesting.annotations.ThreadLeakGroup;
-import com.carrotsearch.randomizedtesting.annotations.ThreadLeakLingering;
-import com.carrotsearch.randomizedtesting.annotations.ThreadLeakScope.Scope;
-import com.carrotsearch.randomizedtesting.annotations.ThreadLeakScope;
-import 
com.carrotsearch.randomizedtesting.annotations.ThreadLeakZombies.Consequence;
-import com.carrotsearch.randomizedtesting.annotations.ThreadLeakZombies;
-import com.carrotsearch.randomizedtesting.annotations.TimeoutSuite;
-import com.carrotsearch.randomizedtesting.generators.RandomPicks;
-import com.carrotsearch.randomizedtesting.rules.NoClassHooksShadowingRule;
-import com.carrotsearch.randomizedtesting.rules.NoInstanceHooksOverridesRule;
-import com.carrotsearch.randomizedtesting.rules.StaticFieldsInvariantRule;
-
-import junit.framework.AssertionFailedError;
 
 Review comment:
   let's undo changes to this class?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] jpountz commented on a change in pull request #1214: LUCENE-9074: Slice Allocation Control Plane For Concurrent Searches

2020-02-14 Thread GitBox
jpountz commented on a change in pull request #1214: LUCENE-9074: Slice 
Allocation Control Plane For Concurrent Searches
URL: https://github.com/apache/lucene-solr/pull/1214#discussion_r379320355
 
 

 ##
 File path: 
lucene/core/src/java/org/apache/lucene/search/DefaultExecutionControlPlane.java
 ##
 @@ -0,0 +1,96 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.search;
+
+import java.util.ArrayList;
+import java.util.Collection;
+import java.util.List;
+import java.util.concurrent.CompletableFuture;
+import java.util.concurrent.Executor;
+import java.util.concurrent.Future;
+import java.util.concurrent.FutureTask;
+import java.util.concurrent.RejectedExecutionException;
+
+/**
+ * Default implementation of SliceExecutionControlPlane which executes 
FutureTask instances. This is used
+ * by IndexSearcher as default unless overridden by a custom implementation.
+ */
+public class DefaultExecutionControlPlane implements 
SliceExecutionControlPlane, FutureTask> {
+  private final Executor executor;
+
+  public DefaultExecutionControlPlane(Executor executor) {
+assert executor != null;
+this.executor = executor;
+  }
+
+  @Override
+  public List invokeAll(Collection tasks) {
+List futures = new ArrayList();
+int i = 0;
+
+for (FutureTask task : tasks) {
+  boolean shouldExecuteOnCallerThread = false;
+
+  // Execute last task on caller thread
+  if (i == tasks.size() - 1) {
+shouldExecuteOnCallerThread = true;
+  }
+
+  processTask(task, futures, shouldExecuteOnCallerThread);
+
+  i++;
+}
+
+return futures;
+  }
+
+  // Helper method to execute a single task
+  protected void processTask(FutureTask task, List futures,
+ boolean shouldExecuteOnCallerThread) {
+if (task == null) {
+  throw new IllegalArgumentException("Input is null");
+}
+
+if (!shouldExecuteOnCallerThread) {
+  try {
+executor.execute(task);
+  } catch (RejectedExecutionException e) {
+// Execute on caller thread
+shouldExecuteOnCallerThread = true;
+  }
+}
+
+if (shouldExecuteOnCallerThread) {
+  try {
+task.run();
+  } catch (Exception e) {
+throw new RuntimeException(e.getMessage());
 
 Review comment:
   please preserve the stack trace of the cause


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] jpountz commented on a change in pull request #587: LUCENE-8707: Add LatLonShape and XYShape distance query

2020-02-14 Thread GitBox
jpountz commented on a change in pull request #587: LUCENE-8707: Add 
LatLonShape and XYShape distance query
URL: https://github.com/apache/lucene-solr/pull/587#discussion_r379317415
 
 

 ##
 File path: lucene/core/src/java/org/apache/lucene/geo/Circle2D.java
 ##
 @@ -0,0 +1,463 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.geo;
+
+import org.apache.lucene.index.PointValues.Relation;
+import org.apache.lucene.util.SloppyMath;
+
+/**
+ * 2D circle implementation containing spatial logic.
+ */
+class Circle2D implements Component2D {
+
+  private final DistanceCalculator calculator;
+
+  private Circle2D(DistanceCalculator calculator) {
+this.calculator = calculator;
+  }
+
+  @Override
+  public double getMinX() {
+return calculator.getMinX();
+  }
+
+  @Override
+  public double getMaxX() {
+return calculator.getMaxX();
+  }
+
+  @Override
+  public double getMinY() {
+return calculator.getMinY();
+  }
+
+  @Override
+  public double getMaxY() {
+return calculator.getMaxY();
+  }
+
+  @Override
+  public boolean contains(double x, double y) {
+return calculator.contains(x, y);
+  }
+
+  @Override
+  public Relation relate(double minX, double maxX, double minY, double maxY) {
+if (calculator.disjoint(minX, maxX, minY, maxY)) {
+  return Relation.CELL_OUTSIDE_QUERY;
+}
+if (calculator.within(minX, maxX, minY, maxY)) {
+  return Relation.CELL_CROSSES_QUERY;
+}
+return calculator.relate(minX, maxX, minY, maxY);
+  }
+
+  @Override
+  public Relation relateTriangle(double minX, double maxX, double minY, double 
maxY,
+ double ax, double ay, double bx, double by, 
double cx, double cy) {
+if (calculator.disjoint(minX, maxX, minY, maxY)) {
+  return Relation.CELL_OUTSIDE_QUERY;
+}
+if (ax == bx && bx == cx && ay == by && by == cy) {
+  // indexed "triangle" is a point: shortcut by checking contains
+  return contains(ax, ay) ? Relation.CELL_INSIDE_QUERY : 
Relation.CELL_OUTSIDE_QUERY;
+} else if (ax == cx && ay == cy) {
+  // indexed "triangle" is a line segment: shortcut by calling appropriate 
method
+  return relateIndexedLineSegment(ax, ay, bx, by);
+} else if (ax == bx && ay == by) {
+  // indexed "triangle" is a line segment: shortcut by calling appropriate 
method
+  return relateIndexedLineSegment(bx, by, cx, cy);
+} else if (bx == cx && by == cy) {
+  // indexed "triangle" is a line segment: shortcut by calling appropriate 
method
+  return relateIndexedLineSegment(cx, cy, ax, ay);
+}
+// indexed "triangle" is a triangle:
+return relateIndexedTriangle(minX, maxX, minY, maxY, ax, ay, bx, by, cx, 
cy);
+  }
+
+  @Override
+  public WithinRelation withinTriangle(double minX, double maxX, double minY, 
double maxY,
+   double ax, double ay, boolean ab, 
double bx, double by, boolean bc, double cx, double cy, boolean ca) {
+// short cut, lines and points cannot contain this type of shape
+if ((ax == bx && ay == by) || (ax == cx && ay == cy) || (bx == cx && by == 
cy)) {
+  return WithinRelation.DISJOINT;
 
 Review comment:
   couldn't it be NOTWITHIN in some cases?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] jpountz commented on a change in pull request #1252: LUCENE-9218: XYGeometries should expose values as floats

2020-02-14 Thread GitBox
jpountz commented on a change in pull request #1252: LUCENE-9218: XYGeometries 
should expose values as floats
URL: https://github.com/apache/lucene-solr/pull/1252#discussion_r379314539
 
 

 ##
 File path: lucene/core/src/java/org/apache/lucene/geo/XYEncodingUtils.java
 ##
 @@ -68,7 +68,15 @@ public static double decode(int encoded) {
* @param offset offset into {@code src} to decode from.
* @return decoded value.
*/
-  public static double decode(byte[] src, int offset) {
+  public static float decode(byte[] src, int offset) {
 return decode(NumericUtils.sortableBytesToInt(src, offset));
   }
+
+  static double[] floatToDouble(float[] f) {
 
 Review comment:
   nit: maybe add `Array` to the method name


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] jpountz commented on a change in pull request #1252: LUCENE-9218: XYGeometries should expose values as floats

2020-02-14 Thread GitBox
jpountz commented on a change in pull request #1252: LUCENE-9218: XYGeometries 
should expose values as floats
URL: https://github.com/apache/lucene-solr/pull/1252#discussion_r379314365
 
 

 ##
 File path: lucene/core/src/java/org/apache/lucene/geo/XYEncodingUtils.java
 ##
 @@ -33,11 +33,12 @@
   private XYEncodingUtils() {
   }
 
-  /** validates value is within +/-{@link Float#MAX_VALUE} coordinate bounds */
-  public static void checkVal(double x) {
-if (Double.isNaN(x) || x < MIN_VAL_INCL || x > MAX_VAL_INCL) {
+  /** validates value is a number and finite */
+  static float checkVal(float x) {
+if (Float.isNaN(x) || Float.isInfinite(x)) {
 
 Review comment:
   This could be a single test.
   ```suggestion
   if (Float.isFinite(x) == false) {
   ```


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9211) Adding compression to BinaryDocValues storage

2020-02-14 Thread Adrien Grand (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17036806#comment-17036806
 ] 

Adrien Grand commented on LUCENE-9211:
--

I had a quick look at Juan's commit, there are things I like and things I have 
questions about. Since this PR is ready, or almost ready, I'd suggest merging 
this one first.

[~juan.duran] I saw that your commit tried to modify the current 
Lucene80DocValuesFormat. I'm a bit nervous about it because it makes it hard to 
spot any potential subtle difference in the on-disk format that would cause 
bugs, so I'd suggest creating a new Lucene85DocValuesFormat instead, even if it 
has the same ideas or even same on-disk format as the current 
Lucene80DocValuesFormat?

> Adding compression to BinaryDocValues storage
> -
>
> Key: LUCENE-9211
> URL: https://issues.apache.org/jira/browse/LUCENE-9211
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/codecs
>Reporter: Mark Harwood
>Assignee: Mark Harwood
>Priority: Minor
>  Labels: pull-request-available
>
> While SortedSetDocValues can be used today to store identical values in a 
> compact form this is not effective for data with many unique values.
> The proposal is that BinaryDocValues should be stored in LZ4 compressed 
> blocks which can dramatically reduce disk storage costs in many cases. The 
> proposal is blocks of a number of documents are stored as a single compressed 
> blob along with metadata that records offsets where the original document 
> values can be found in the uncompressed content.
> There's a trade-off here between efficient compression (more docs-per-block = 
> better compression) and fast retrieval times (fewer docs-per-block = faster 
> read access for single values). A fixed block size of 32 docs seems like it 
> would be a reasonable compromise for most scenarios.
> A PR is up for review here [https://github.com/apache/lucene-solr/pull/1234]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] jpountz commented on a change in pull request #1234: Add compression for Binary doc value fields

2020-02-14 Thread GitBox
jpountz commented on a change in pull request #1234: Add compression for Binary 
doc value fields
URL: https://github.com/apache/lucene-solr/pull/1234#discussion_r379305128
 
 

 ##
 File path: 
lucene/core/src/java/org/apache/lucene/codecs/lucene80/Lucene80DocValuesProducer.java
 ##
 @@ -742,6 +757,125 @@ public BytesRef binaryValue() throws IOException {
 };
   }
 }
+  }  
+  
+  // Decompresses blocks of binary values to retrieve content
+  class BinaryDecoder {
+
+private final LongValues addresses;
+private final IndexInput compressedData;
+// Cache of last uncompressed block 
+private long lastBlockId = -1;
+private final int []uncompressedDocStarts;
+private int uncompressedBlockLength = 0;
+private final byte[] uncompressedBlock;
+private final BytesRef uncompressedBytesRef;
+private final int docsPerChunk;
+
+public BinaryDecoder(LongValues addresses, IndexInput compressedData, int 
biggestUncompressedBlockSize, int docsPerChunk) {
+  super();
+  this.addresses = addresses;
+  this.compressedData = compressedData;
+  // pre-allocate a byte array large enough for the biggest uncompressed 
block needed.
+  this.uncompressedBlock = new byte[biggestUncompressedBlockSize];
+  uncompressedBytesRef = new BytesRef(uncompressedBlock);
+  this.docsPerChunk = docsPerChunk;
+  uncompressedDocStarts = new int[docsPerChunk + 1];
+  
+}
+
+BytesRef decode(int docNumber) throws IOException {
+  int blockId = docNumber >> Lucene80DocValuesFormat.BINARY_BLOCK_SHIFT; 
 
 Review comment:
   let's use the shift from the BinaryEntry instead of the constant?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] jpountz commented on a change in pull request #1234: Add compression for Binary doc value fields

2020-02-14 Thread GitBox
jpountz commented on a change in pull request #1234: Add compression for Binary 
doc value fields
URL: https://github.com/apache/lucene-solr/pull/1234#discussion_r379299380
 
 

 ##
 File path: 
lucene/core/src/java/org/apache/lucene/codecs/lucene80/Lucene80DocValuesConsumer.java
 ##
 @@ -353,67 +360,168 @@ private void writeBlock(long[] values, int length, long 
gcd, ByteBuffersDataOutp
 }
   }
 
-  @Override
-  public void addBinaryField(FieldInfo field, DocValuesProducer 
valuesProducer) throws IOException {
-meta.writeInt(field.number);
-meta.writeByte(Lucene80DocValuesFormat.BINARY);
-
-BinaryDocValues values = valuesProducer.getBinary(field);
-long start = data.getFilePointer();
-meta.writeLong(start); // dataOffset
-int numDocsWithField = 0;
-int minLength = Integer.MAX_VALUE;
-int maxLength = 0;
-for (int doc = values.nextDoc(); doc != DocIdSetIterator.NO_MORE_DOCS; doc 
= values.nextDoc()) {
-  numDocsWithField++;
-  BytesRef v = values.binaryValue();
-  int length = v.length;
-  data.writeBytes(v.bytes, v.offset, v.length);
-  minLength = Math.min(length, minLength);
-  maxLength = Math.max(length, maxLength);
+  class CompressedBinaryBlockWriter implements Closeable {
+FastCompressionHashTable ht = new LZ4.FastCompressionHashTable();
+int uncompressedBlockLength = 0;
+int maxUncompressedBlockLength = 0;
+int numDocsInCurrentBlock = 0;
+int [] docLengths = new 
int[Lucene80DocValuesFormat.BINARY_DOCS_PER_COMPRESSED_BLOCK]; 
+byte [] block = new byte [1024 * 16];
+int totalChunks = 0;
+long maxPointer = 0;
+long blockAddressesStart = -1; 
+
+private IndexOutput tempBinaryOffsets;
+
+
+public CompressedBinaryBlockWriter() throws IOException {
+  tempBinaryOffsets = 
state.directory.createTempOutput(state.segmentInfo.name, "binary_pointers", 
state.context);
+  try {
+CodecUtil.writeHeader(tempBinaryOffsets, 
Lucene80DocValuesFormat.META_CODEC + "FilePointers", 
Lucene80DocValuesFormat.VERSION_CURRENT);
+  } catch (Throwable exception) {
+IOUtils.closeWhileHandlingException(this); //self-close because 
constructor caller can't 
+throw exception;
+  }
 }
-assert numDocsWithField <= maxDoc;
-meta.writeLong(data.getFilePointer() - start); // dataLength
 
-if (numDocsWithField == 0) {
-  meta.writeLong(-2); // docsWithFieldOffset
-  meta.writeLong(0L); // docsWithFieldLength
-  meta.writeShort((short) -1); // jumpTableEntryCount
-  meta.writeByte((byte) -1);   // denseRankPower
-} else if (numDocsWithField == maxDoc) {
-  meta.writeLong(-1); // docsWithFieldOffset
-  meta.writeLong(0L); // docsWithFieldLength
-  meta.writeShort((short) -1); // jumpTableEntryCount
-  meta.writeByte((byte) -1);   // denseRankPower
-} else {
-  long offset = data.getFilePointer();
-  meta.writeLong(offset); // docsWithFieldOffset
-  values = valuesProducer.getBinary(field);
-  final short jumpTableEntryCount = IndexedDISI.writeBitSet(values, data, 
IndexedDISI.DEFAULT_DENSE_RANK_POWER);
-  meta.writeLong(data.getFilePointer() - offset); // docsWithFieldLength
-  meta.writeShort(jumpTableEntryCount);
-  meta.writeByte(IndexedDISI.DEFAULT_DENSE_RANK_POWER);
+void addDoc(int doc, BytesRef v) throws IOException {
+  if (blockAddressesStart < 0) {
+blockAddressesStart = data.getFilePointer();
+  }
+  docLengths[numDocsInCurrentBlock] = v.length;
+  block = ArrayUtil.grow(block, uncompressedBlockLength + v.length);
+  System.arraycopy(v.bytes, v.offset, block, uncompressedBlockLength, 
v.length);
+  uncompressedBlockLength += v.length;
+  numDocsInCurrentBlock++;
+  if (numDocsInCurrentBlock == 
Lucene80DocValuesFormat.BINARY_DOCS_PER_COMPRESSED_BLOCK) {
+flushData();
+  }  
 }
 
-meta.writeInt(numDocsWithField);
-meta.writeInt(minLength);
-meta.writeInt(maxLength);
-if (maxLength > minLength) {
-  start = data.getFilePointer();
-  meta.writeLong(start);
+private void flushData() throws IOException {
+  if (numDocsInCurrentBlock > 0) {
+// Write offset to this block to temporary offsets file
+totalChunks++;
+long thisBlockStartPointer = data.getFilePointer();
+data.writeVInt(numDocsInCurrentBlock);
+for (int i = 0; i < numDocsInCurrentBlock; i++) {
+  data.writeVInt(docLengths[i]);
+}
+maxUncompressedBlockLength = Math.max(maxUncompressedBlockLength, 
uncompressedBlockLength);
+LZ4.compress(block,  0, uncompressedBlockLength, data, ht);
+numDocsInCurrentBlock = 0;
+uncompressedBlockLength = 0;
+maxPointer = data.getFilePointer();
+tempBinaryOffsets.writeVLong(maxPointer - thisBlockStartPointer);
+  }
+}
+
+void writeMetaData() throws IOException {
+  if 

[GitHub] [lucene-solr] jpountz commented on a change in pull request #1234: Add compression for Binary doc value fields

2020-02-14 Thread GitBox
jpountz commented on a change in pull request #1234: Add compression for Binary 
doc value fields
URL: https://github.com/apache/lucene-solr/pull/1234#discussion_r379306909
 
 

 ##
 File path: 
lucene/core/src/java/org/apache/lucene/codecs/lucene80/Lucene80DocValuesConsumer.java
 ##
 @@ -353,67 +360,193 @@ private void writeBlock(long[] values, int length, long 
gcd, ByteBuffersDataOutp
 }
   }
 
-  @Override
-  public void addBinaryField(FieldInfo field, DocValuesProducer 
valuesProducer) throws IOException {
-meta.writeInt(field.number);
-meta.writeByte(Lucene80DocValuesFormat.BINARY);
-
-BinaryDocValues values = valuesProducer.getBinary(field);
-long start = data.getFilePointer();
-meta.writeLong(start); // dataOffset
-int numDocsWithField = 0;
-int minLength = Integer.MAX_VALUE;
-int maxLength = 0;
-for (int doc = values.nextDoc(); doc != DocIdSetIterator.NO_MORE_DOCS; doc 
= values.nextDoc()) {
-  numDocsWithField++;
-  BytesRef v = values.binaryValue();
-  int length = v.length;
-  data.writeBytes(v.bytes, v.offset, v.length);
-  minLength = Math.min(length, minLength);
-  maxLength = Math.max(length, maxLength);
+  class CompressedBinaryBlockWriter implements Closeable {
+FastCompressionHashTable ht = new LZ4.FastCompressionHashTable();
+int uncompressedBlockLength = 0;
+int maxUncompressedBlockLength = 0;
+int numDocsInCurrentBlock = 0;
+int[] docLengths = new 
int[Lucene80DocValuesFormat.BINARY_DOCS_PER_COMPRESSED_BLOCK]; 
+byte[] block = new byte [1024 * 16];
+int totalChunks = 0;
+long maxPointer = 0;
+long blockAddressesStart = -1; 
+
+private IndexOutput tempBinaryOffsets;
+
+
+public CompressedBinaryBlockWriter() throws IOException {
+  tempBinaryOffsets = 
state.directory.createTempOutput(state.segmentInfo.name, "binary_pointers", 
state.context);
+  boolean success = false;
+  try {
+CodecUtil.writeHeader(tempBinaryOffsets, 
Lucene80DocValuesFormat.META_CODEC + "FilePointers", 
Lucene80DocValuesFormat.VERSION_CURRENT);
+success = true;
+  } finally {
+if (success == false) {
+  IOUtils.closeWhileHandlingException(this); //self-close because 
constructor caller can't 
+}
+  }
 }
-assert numDocsWithField <= maxDoc;
-meta.writeLong(data.getFilePointer() - start); // dataLength
 
-if (numDocsWithField == 0) {
-  meta.writeLong(-2); // docsWithFieldOffset
-  meta.writeLong(0L); // docsWithFieldLength
-  meta.writeShort((short) -1); // jumpTableEntryCount
-  meta.writeByte((byte) -1);   // denseRankPower
-} else if (numDocsWithField == maxDoc) {
-  meta.writeLong(-1); // docsWithFieldOffset
-  meta.writeLong(0L); // docsWithFieldLength
-  meta.writeShort((short) -1); // jumpTableEntryCount
-  meta.writeByte((byte) -1);   // denseRankPower
-} else {
-  long offset = data.getFilePointer();
-  meta.writeLong(offset); // docsWithFieldOffset
-  values = valuesProducer.getBinary(field);
-  final short jumpTableEntryCount = IndexedDISI.writeBitSet(values, data, 
IndexedDISI.DEFAULT_DENSE_RANK_POWER);
-  meta.writeLong(data.getFilePointer() - offset); // docsWithFieldLength
-  meta.writeShort(jumpTableEntryCount);
-  meta.writeByte(IndexedDISI.DEFAULT_DENSE_RANK_POWER);
+void addDoc(int doc, BytesRef v) throws IOException {
+  if (blockAddressesStart < 0) {
+blockAddressesStart = data.getFilePointer();
+  }
+  docLengths[numDocsInCurrentBlock] = v.length;
+  block = ArrayUtil.grow(block, uncompressedBlockLength + v.length);
+  System.arraycopy(v.bytes, v.offset, block, uncompressedBlockLength, 
v.length);
+  uncompressedBlockLength += v.length;
+  numDocsInCurrentBlock++;
+  if (numDocsInCurrentBlock == 
Lucene80DocValuesFormat.BINARY_DOCS_PER_COMPRESSED_BLOCK) {
+flushData();
+  }  
 }
 
-meta.writeInt(numDocsWithField);
-meta.writeInt(minLength);
-meta.writeInt(maxLength);
-if (maxLength > minLength) {
-  start = data.getFilePointer();
-  meta.writeLong(start);
+private void flushData() throws IOException {
+  if (numDocsInCurrentBlock > 0) {
+// Write offset to this block to temporary offsets file
+totalChunks++;
+long thisBlockStartPointer = data.getFilePointer();
+
+// Optimisation - check if all lengths are same
+boolean allLengthsSame = true && numDocsInCurrentBlock >0  ;
+for (int i = 0; i < 
Lucene80DocValuesFormat.BINARY_DOCS_PER_COMPRESSED_BLOCK && allLengthsSame; 
i++) {
 
 Review comment:
   in general we do a `break` when setting `allLengthsSame = false` instead of 
adding it to the exit condition of the for statement


This is an automated message from the Apache Git Service.
To respond to the 

[GitHub] [lucene-solr] jpountz commented on a change in pull request #1234: Add compression for Binary doc value fields

2020-02-14 Thread GitBox
jpountz commented on a change in pull request #1234: Add compression for Binary 
doc value fields
URL: https://github.com/apache/lucene-solr/pull/1234#discussion_r379298761
 
 

 ##
 File path: 
lucene/core/src/java/org/apache/lucene/codecs/lucene80/Lucene80DocValuesConsumer.java
 ##
 @@ -353,67 +360,193 @@ private void writeBlock(long[] values, int length, long 
gcd, ByteBuffersDataOutp
 }
   }
 
-  @Override
-  public void addBinaryField(FieldInfo field, DocValuesProducer 
valuesProducer) throws IOException {
-meta.writeInt(field.number);
-meta.writeByte(Lucene80DocValuesFormat.BINARY);
-
-BinaryDocValues values = valuesProducer.getBinary(field);
-long start = data.getFilePointer();
-meta.writeLong(start); // dataOffset
-int numDocsWithField = 0;
-int minLength = Integer.MAX_VALUE;
-int maxLength = 0;
-for (int doc = values.nextDoc(); doc != DocIdSetIterator.NO_MORE_DOCS; doc 
= values.nextDoc()) {
-  numDocsWithField++;
-  BytesRef v = values.binaryValue();
-  int length = v.length;
-  data.writeBytes(v.bytes, v.offset, v.length);
-  minLength = Math.min(length, minLength);
-  maxLength = Math.max(length, maxLength);
+  class CompressedBinaryBlockWriter implements Closeable {
+FastCompressionHashTable ht = new LZ4.FastCompressionHashTable();
+int uncompressedBlockLength = 0;
+int maxUncompressedBlockLength = 0;
+int numDocsInCurrentBlock = 0;
+int[] docLengths = new 
int[Lucene80DocValuesFormat.BINARY_DOCS_PER_COMPRESSED_BLOCK]; 
+byte[] block = new byte [1024 * 16];
+int totalChunks = 0;
+long maxPointer = 0;
+long blockAddressesStart = -1; 
+
+private IndexOutput tempBinaryOffsets;
+
+
+public CompressedBinaryBlockWriter() throws IOException {
+  tempBinaryOffsets = 
state.directory.createTempOutput(state.segmentInfo.name, "binary_pointers", 
state.context);
+  boolean success = false;
+  try {
+CodecUtil.writeHeader(tempBinaryOffsets, 
Lucene80DocValuesFormat.META_CODEC + "FilePointers", 
Lucene80DocValuesFormat.VERSION_CURRENT);
+success = true;
+  } finally {
+if (success == false) {
+  IOUtils.closeWhileHandlingException(this); //self-close because 
constructor caller can't 
+}
+  }
 }
-assert numDocsWithField <= maxDoc;
-meta.writeLong(data.getFilePointer() - start); // dataLength
 
-if (numDocsWithField == 0) {
-  meta.writeLong(-2); // docsWithFieldOffset
-  meta.writeLong(0L); // docsWithFieldLength
-  meta.writeShort((short) -1); // jumpTableEntryCount
-  meta.writeByte((byte) -1);   // denseRankPower
-} else if (numDocsWithField == maxDoc) {
-  meta.writeLong(-1); // docsWithFieldOffset
-  meta.writeLong(0L); // docsWithFieldLength
-  meta.writeShort((short) -1); // jumpTableEntryCount
-  meta.writeByte((byte) -1);   // denseRankPower
-} else {
-  long offset = data.getFilePointer();
-  meta.writeLong(offset); // docsWithFieldOffset
-  values = valuesProducer.getBinary(field);
-  final short jumpTableEntryCount = IndexedDISI.writeBitSet(values, data, 
IndexedDISI.DEFAULT_DENSE_RANK_POWER);
-  meta.writeLong(data.getFilePointer() - offset); // docsWithFieldLength
-  meta.writeShort(jumpTableEntryCount);
-  meta.writeByte(IndexedDISI.DEFAULT_DENSE_RANK_POWER);
+void addDoc(int doc, BytesRef v) throws IOException {
+  if (blockAddressesStart < 0) {
+blockAddressesStart = data.getFilePointer();
+  }
+  docLengths[numDocsInCurrentBlock] = v.length;
+  block = ArrayUtil.grow(block, uncompressedBlockLength + v.length);
+  System.arraycopy(v.bytes, v.offset, block, uncompressedBlockLength, 
v.length);
+  uncompressedBlockLength += v.length;
+  numDocsInCurrentBlock++;
+  if (numDocsInCurrentBlock == 
Lucene80DocValuesFormat.BINARY_DOCS_PER_COMPRESSED_BLOCK) {
+flushData();
+  }  
 }
 
-meta.writeInt(numDocsWithField);
-meta.writeInt(minLength);
-meta.writeInt(maxLength);
-if (maxLength > minLength) {
-  start = data.getFilePointer();
-  meta.writeLong(start);
+private void flushData() throws IOException {
+  if (numDocsInCurrentBlock > 0) {
+// Write offset to this block to temporary offsets file
+totalChunks++;
+long thisBlockStartPointer = data.getFilePointer();
+
+// Optimisation - check if all lengths are same
+boolean allLengthsSame = true && numDocsInCurrentBlock >0  ;
 
 Review comment:
   The second condition is necessary true given the parent if statement.
   ```suggestion
   boolean allLengthsSame = true;
   ```


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this 

[GitHub] [lucene-solr] jpountz commented on a change in pull request #1234: Add compression for Binary doc value fields

2020-02-14 Thread GitBox
jpountz commented on a change in pull request #1234: Add compression for Binary 
doc value fields
URL: https://github.com/apache/lucene-solr/pull/1234#discussion_r379304074
 
 

 ##
 File path: 
lucene/core/src/java/org/apache/lucene/codecs/lucene80/Lucene80DocValuesProducer.java
 ##
 @@ -59,18 +60,18 @@
   private long ramBytesUsed;
   private final IndexInput data;
   private final int maxDoc;
+  private int version = -1;
 
   /** expert: instantiates a new reader */
   Lucene80DocValuesProducer(SegmentReadState state, String dataCodec, String 
dataExtension, String metaCodec, String metaExtension) throws IOException {
 String metaName = IndexFileNames.segmentFileName(state.segmentInfo.name, 
state.segmentSuffix, metaExtension);
 this.maxDoc = state.segmentInfo.maxDoc();
 ramBytesUsed = RamUsageEstimator.shallowSizeOfInstance(getClass());
 
-int version = -1;
 
 Review comment:
   maybe keep this variable actually, it would help make version final by doing 
`this.version = version;` after the try block?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] jpountz commented on a change in pull request #1234: Add compression for Binary doc value fields

2020-02-14 Thread GitBox
jpountz commented on a change in pull request #1234: Add compression for Binary 
doc value fields
URL: https://github.com/apache/lucene-solr/pull/1234#discussion_r379294799
 
 

 ##
 File path: 
lucene/core/src/java/org/apache/lucene/codecs/lucene80/Lucene80DocValuesConsumer.java
 ##
 @@ -353,67 +360,193 @@ private void writeBlock(long[] values, int length, long 
gcd, ByteBuffersDataOutp
 }
   }
 
-  @Override
-  public void addBinaryField(FieldInfo field, DocValuesProducer 
valuesProducer) throws IOException {
-meta.writeInt(field.number);
-meta.writeByte(Lucene80DocValuesFormat.BINARY);
-
-BinaryDocValues values = valuesProducer.getBinary(field);
-long start = data.getFilePointer();
-meta.writeLong(start); // dataOffset
-int numDocsWithField = 0;
-int minLength = Integer.MAX_VALUE;
-int maxLength = 0;
-for (int doc = values.nextDoc(); doc != DocIdSetIterator.NO_MORE_DOCS; doc 
= values.nextDoc()) {
-  numDocsWithField++;
-  BytesRef v = values.binaryValue();
-  int length = v.length;
-  data.writeBytes(v.bytes, v.offset, v.length);
-  minLength = Math.min(length, minLength);
-  maxLength = Math.max(length, maxLength);
+  class CompressedBinaryBlockWriter implements Closeable {
+FastCompressionHashTable ht = new LZ4.FastCompressionHashTable();
+int uncompressedBlockLength = 0;
+int maxUncompressedBlockLength = 0;
+int numDocsInCurrentBlock = 0;
+int[] docLengths = new 
int[Lucene80DocValuesFormat.BINARY_DOCS_PER_COMPRESSED_BLOCK]; 
+byte[] block = new byte [1024 * 16];
+int totalChunks = 0;
+long maxPointer = 0;
+long blockAddressesStart = -1; 
+
+private IndexOutput tempBinaryOffsets;
 
 Review comment:
   can we make `ht`, `tempBinaryOffsets`, `docLengths` final?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] jpountz commented on a change in pull request #1234: Add compression for Binary doc value fields

2020-02-14 Thread GitBox
jpountz commented on a change in pull request #1234: Add compression for Binary 
doc value fields
URL: https://github.com/apache/lucene-solr/pull/1234#discussion_r379297803
 
 

 ##
 File path: 
lucene/core/src/java/org/apache/lucene/codecs/lucene80/Lucene80DocValuesConsumer.java
 ##
 @@ -353,67 +360,168 @@ private void writeBlock(long[] values, int length, long 
gcd, ByteBuffersDataOutp
 }
   }
 
-  @Override
-  public void addBinaryField(FieldInfo field, DocValuesProducer 
valuesProducer) throws IOException {
-meta.writeInt(field.number);
-meta.writeByte(Lucene80DocValuesFormat.BINARY);
-
-BinaryDocValues values = valuesProducer.getBinary(field);
-long start = data.getFilePointer();
-meta.writeLong(start); // dataOffset
-int numDocsWithField = 0;
-int minLength = Integer.MAX_VALUE;
-int maxLength = 0;
-for (int doc = values.nextDoc(); doc != DocIdSetIterator.NO_MORE_DOCS; doc 
= values.nextDoc()) {
-  numDocsWithField++;
-  BytesRef v = values.binaryValue();
-  int length = v.length;
-  data.writeBytes(v.bytes, v.offset, v.length);
-  minLength = Math.min(length, minLength);
-  maxLength = Math.max(length, maxLength);
+  class CompressedBinaryBlockWriter implements Closeable {
+FastCompressionHashTable ht = new LZ4.FastCompressionHashTable();
+int uncompressedBlockLength = 0;
+int maxUncompressedBlockLength = 0;
+int numDocsInCurrentBlock = 0;
+int [] docLengths = new 
int[Lucene80DocValuesFormat.BINARY_DOCS_PER_COMPRESSED_BLOCK]; 
+byte [] block = new byte [1024 * 16];
+int totalChunks = 0;
+long maxPointer = 0;
+long blockAddressesStart = -1; 
+
+private IndexOutput tempBinaryOffsets;
+
+
+public CompressedBinaryBlockWriter() throws IOException {
+  tempBinaryOffsets = 
state.directory.createTempOutput(state.segmentInfo.name, "binary_pointers", 
state.context);
+  try {
+CodecUtil.writeHeader(tempBinaryOffsets, 
Lucene80DocValuesFormat.META_CODEC + "FilePointers", 
Lucene80DocValuesFormat.VERSION_CURRENT);
+  } catch (Throwable exception) {
+IOUtils.closeWhileHandlingException(this); //self-close because 
constructor caller can't 
+throw exception;
+  }
 }
-assert numDocsWithField <= maxDoc;
-meta.writeLong(data.getFilePointer() - start); // dataLength
 
-if (numDocsWithField == 0) {
-  meta.writeLong(-2); // docsWithFieldOffset
-  meta.writeLong(0L); // docsWithFieldLength
-  meta.writeShort((short) -1); // jumpTableEntryCount
-  meta.writeByte((byte) -1);   // denseRankPower
-} else if (numDocsWithField == maxDoc) {
-  meta.writeLong(-1); // docsWithFieldOffset
-  meta.writeLong(0L); // docsWithFieldLength
-  meta.writeShort((short) -1); // jumpTableEntryCount
-  meta.writeByte((byte) -1);   // denseRankPower
-} else {
-  long offset = data.getFilePointer();
-  meta.writeLong(offset); // docsWithFieldOffset
-  values = valuesProducer.getBinary(field);
-  final short jumpTableEntryCount = IndexedDISI.writeBitSet(values, data, 
IndexedDISI.DEFAULT_DENSE_RANK_POWER);
-  meta.writeLong(data.getFilePointer() - offset); // docsWithFieldLength
-  meta.writeShort(jumpTableEntryCount);
-  meta.writeByte(IndexedDISI.DEFAULT_DENSE_RANK_POWER);
+void addDoc(int doc, BytesRef v) throws IOException {
+  if (blockAddressesStart < 0) {
+blockAddressesStart = data.getFilePointer();
+  }
 
 Review comment:
   Have you found what this `something else` is?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] jpountz commented on a change in pull request #1234: Add compression for Binary doc value fields

2020-02-14 Thread GitBox
jpountz commented on a change in pull request #1234: Add compression for Binary 
doc value fields
URL: https://github.com/apache/lucene-solr/pull/1234#discussion_r379298212
 
 

 ##
 File path: 
lucene/core/src/java/org/apache/lucene/codecs/lucene80/Lucene80DocValuesConsumer.java
 ##
 @@ -353,67 +360,193 @@ private void writeBlock(long[] values, int length, long 
gcd, ByteBuffersDataOutp
 }
   }
 
-  @Override
-  public void addBinaryField(FieldInfo field, DocValuesProducer 
valuesProducer) throws IOException {
-meta.writeInt(field.number);
-meta.writeByte(Lucene80DocValuesFormat.BINARY);
-
-BinaryDocValues values = valuesProducer.getBinary(field);
-long start = data.getFilePointer();
-meta.writeLong(start); // dataOffset
-int numDocsWithField = 0;
-int minLength = Integer.MAX_VALUE;
-int maxLength = 0;
-for (int doc = values.nextDoc(); doc != DocIdSetIterator.NO_MORE_DOCS; doc 
= values.nextDoc()) {
-  numDocsWithField++;
-  BytesRef v = values.binaryValue();
-  int length = v.length;
-  data.writeBytes(v.bytes, v.offset, v.length);
-  minLength = Math.min(length, minLength);
-  maxLength = Math.max(length, maxLength);
+  class CompressedBinaryBlockWriter implements Closeable {
+FastCompressionHashTable ht = new LZ4.FastCompressionHashTable();
+int uncompressedBlockLength = 0;
+int maxUncompressedBlockLength = 0;
+int numDocsInCurrentBlock = 0;
+int[] docLengths = new 
int[Lucene80DocValuesFormat.BINARY_DOCS_PER_COMPRESSED_BLOCK]; 
+byte[] block = new byte [1024 * 16];
+int totalChunks = 0;
+long maxPointer = 0;
+long blockAddressesStart = -1; 
+
+private IndexOutput tempBinaryOffsets;
+
+
+public CompressedBinaryBlockWriter() throws IOException {
+  tempBinaryOffsets = 
state.directory.createTempOutput(state.segmentInfo.name, "binary_pointers", 
state.context);
+  boolean success = false;
+  try {
+CodecUtil.writeHeader(tempBinaryOffsets, 
Lucene80DocValuesFormat.META_CODEC + "FilePointers", 
Lucene80DocValuesFormat.VERSION_CURRENT);
+success = true;
+  } finally {
+if (success == false) {
+  IOUtils.closeWhileHandlingException(this); //self-close because 
constructor caller can't 
+}
+  }
 }
-assert numDocsWithField <= maxDoc;
-meta.writeLong(data.getFilePointer() - start); // dataLength
 
-if (numDocsWithField == 0) {
-  meta.writeLong(-2); // docsWithFieldOffset
-  meta.writeLong(0L); // docsWithFieldLength
-  meta.writeShort((short) -1); // jumpTableEntryCount
-  meta.writeByte((byte) -1);   // denseRankPower
-} else if (numDocsWithField == maxDoc) {
-  meta.writeLong(-1); // docsWithFieldOffset
-  meta.writeLong(0L); // docsWithFieldLength
-  meta.writeShort((short) -1); // jumpTableEntryCount
-  meta.writeByte((byte) -1);   // denseRankPower
-} else {
-  long offset = data.getFilePointer();
-  meta.writeLong(offset); // docsWithFieldOffset
-  values = valuesProducer.getBinary(field);
-  final short jumpTableEntryCount = IndexedDISI.writeBitSet(values, data, 
IndexedDISI.DEFAULT_DENSE_RANK_POWER);
-  meta.writeLong(data.getFilePointer() - offset); // docsWithFieldLength
-  meta.writeShort(jumpTableEntryCount);
-  meta.writeByte(IndexedDISI.DEFAULT_DENSE_RANK_POWER);
+void addDoc(int doc, BytesRef v) throws IOException {
+  if (blockAddressesStart < 0) {
+blockAddressesStart = data.getFilePointer();
+  }
+  docLengths[numDocsInCurrentBlock] = v.length;
+  block = ArrayUtil.grow(block, uncompressedBlockLength + v.length);
+  System.arraycopy(v.bytes, v.offset, block, uncompressedBlockLength, 
v.length);
+  uncompressedBlockLength += v.length;
+  numDocsInCurrentBlock++;
+  if (numDocsInCurrentBlock == 
Lucene80DocValuesFormat.BINARY_DOCS_PER_COMPRESSED_BLOCK) {
+flushData();
+  }  
 }
 
-meta.writeInt(numDocsWithField);
-meta.writeInt(minLength);
-meta.writeInt(maxLength);
-if (maxLength > minLength) {
-  start = data.getFilePointer();
-  meta.writeLong(start);
+private void flushData() throws IOException {
+  if (numDocsInCurrentBlock > 0) {
+// Write offset to this block to temporary offsets file
+totalChunks++;
+long thisBlockStartPointer = data.getFilePointer();
+
+// Optimisation - check if all lengths are same
+boolean allLengthsSame = true && numDocsInCurrentBlock >0  ;
+for (int i = 0; i < 
Lucene80DocValuesFormat.BINARY_DOCS_PER_COMPRESSED_BLOCK && allLengthsSame; 
i++) {
+  if (i > 0 && docLengths[i] != docLengths[i-1]) {
+allLengthsSame = false;
+  }
+}
+if (allLengthsSame) {
+// Only write one value shifted. Steal a bit to indicate all other 
lengths are the same
+int onlyOneLength = (docLengths[0] 

[GitHub] [lucene-solr] jpountz commented on a change in pull request #1234: Add compression for Binary doc value fields

2020-02-14 Thread GitBox
jpountz commented on a change in pull request #1234: Add compression for Binary 
doc value fields
URL: https://github.com/apache/lucene-solr/pull/1234#discussion_r379304369
 
 

 ##
 File path: 
lucene/core/src/java/org/apache/lucene/codecs/lucene80/Lucene80DocValuesProducer.java
 ##
 @@ -182,10 +183,21 @@ private BinaryEntry readBinary(ChecksumIndexInput meta) 
throws IOException {
 entry.numDocsWithField = meta.readInt();
 entry.minLength = meta.readInt();
 entry.maxLength = meta.readInt();
-if (entry.minLength < entry.maxLength) {
+if ((version >= Lucene80DocValuesFormat.VERSION_BIN_COMPRESSED && 
entry.numDocsWithField >0)||  entry.minLength < entry.maxLength) {
 
 Review comment:
   ```suggestion
   if ((version >= Lucene80DocValuesFormat.VERSION_BIN_COMPRESSED && 
entry.numDocsWithField > 0) ||  entry.minLength < entry.maxLength) {
   ```


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] jpountz commented on a change in pull request #1234: Add compression for Binary doc value fields

2020-02-14 Thread GitBox
jpountz commented on a change in pull request #1234: Add compression for Binary 
doc value fields
URL: https://github.com/apache/lucene-solr/pull/1234#discussion_r379295432
 
 

 ##
 File path: 
lucene/core/src/java/org/apache/lucene/codecs/lucene80/Lucene80DocValuesConsumer.java
 ##
 @@ -353,67 +360,193 @@ private void writeBlock(long[] values, int length, long 
gcd, ByteBuffersDataOutp
 }
   }
 
-  @Override
-  public void addBinaryField(FieldInfo field, DocValuesProducer 
valuesProducer) throws IOException {
-meta.writeInt(field.number);
-meta.writeByte(Lucene80DocValuesFormat.BINARY);
-
-BinaryDocValues values = valuesProducer.getBinary(field);
-long start = data.getFilePointer();
-meta.writeLong(start); // dataOffset
-int numDocsWithField = 0;
-int minLength = Integer.MAX_VALUE;
-int maxLength = 0;
-for (int doc = values.nextDoc(); doc != DocIdSetIterator.NO_MORE_DOCS; doc 
= values.nextDoc()) {
-  numDocsWithField++;
-  BytesRef v = values.binaryValue();
-  int length = v.length;
-  data.writeBytes(v.bytes, v.offset, v.length);
-  minLength = Math.min(length, minLength);
-  maxLength = Math.max(length, maxLength);
+  class CompressedBinaryBlockWriter implements Closeable {
+FastCompressionHashTable ht = new LZ4.FastCompressionHashTable();
+int uncompressedBlockLength = 0;
+int maxUncompressedBlockLength = 0;
+int numDocsInCurrentBlock = 0;
+int[] docLengths = new 
int[Lucene80DocValuesFormat.BINARY_DOCS_PER_COMPRESSED_BLOCK]; 
+byte[] block = new byte [1024 * 16];
 
 Review comment:
   Depending on the data that will be indexed it's very hard to know what is 
the right initial size here. Maybe start with an empty array? This will also 
give increase confidence that the resizing logic works.
   
   ```suggestion
   byte[] block = BytesRef.EMPTY_BYTES;
   ```


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] jpountz commented on a change in pull request #1234: Add compression for Binary doc value fields

2020-02-14 Thread GitBox
jpountz commented on a change in pull request #1234: Add compression for Binary 
doc value fields
URL: https://github.com/apache/lucene-solr/pull/1234#discussion_r379307116
 
 

 ##
 File path: 
lucene/core/src/java/org/apache/lucene/codecs/lucene80/Lucene80DocValuesConsumer.java
 ##
 @@ -353,67 +360,193 @@ private void writeBlock(long[] values, int length, long 
gcd, ByteBuffersDataOutp
 }
   }
 
-  @Override
-  public void addBinaryField(FieldInfo field, DocValuesProducer 
valuesProducer) throws IOException {
-meta.writeInt(field.number);
-meta.writeByte(Lucene80DocValuesFormat.BINARY);
-
-BinaryDocValues values = valuesProducer.getBinary(field);
-long start = data.getFilePointer();
-meta.writeLong(start); // dataOffset
-int numDocsWithField = 0;
-int minLength = Integer.MAX_VALUE;
-int maxLength = 0;
-for (int doc = values.nextDoc(); doc != DocIdSetIterator.NO_MORE_DOCS; doc 
= values.nextDoc()) {
-  numDocsWithField++;
-  BytesRef v = values.binaryValue();
-  int length = v.length;
-  data.writeBytes(v.bytes, v.offset, v.length);
-  minLength = Math.min(length, minLength);
-  maxLength = Math.max(length, maxLength);
+  class CompressedBinaryBlockWriter implements Closeable {
+FastCompressionHashTable ht = new LZ4.FastCompressionHashTable();
+int uncompressedBlockLength = 0;
+int maxUncompressedBlockLength = 0;
+int numDocsInCurrentBlock = 0;
+int[] docLengths = new 
int[Lucene80DocValuesFormat.BINARY_DOCS_PER_COMPRESSED_BLOCK]; 
+byte[] block = new byte [1024 * 16];
+int totalChunks = 0;
+long maxPointer = 0;
+long blockAddressesStart = -1; 
+
+private IndexOutput tempBinaryOffsets;
+
+
+public CompressedBinaryBlockWriter() throws IOException {
+  tempBinaryOffsets = 
state.directory.createTempOutput(state.segmentInfo.name, "binary_pointers", 
state.context);
+  boolean success = false;
+  try {
+CodecUtil.writeHeader(tempBinaryOffsets, 
Lucene80DocValuesFormat.META_CODEC + "FilePointers", 
Lucene80DocValuesFormat.VERSION_CURRENT);
+success = true;
+  } finally {
+if (success == false) {
+  IOUtils.closeWhileHandlingException(this); //self-close because 
constructor caller can't 
+}
+  }
 }
-assert numDocsWithField <= maxDoc;
-meta.writeLong(data.getFilePointer() - start); // dataLength
 
-if (numDocsWithField == 0) {
-  meta.writeLong(-2); // docsWithFieldOffset
-  meta.writeLong(0L); // docsWithFieldLength
-  meta.writeShort((short) -1); // jumpTableEntryCount
-  meta.writeByte((byte) -1);   // denseRankPower
-} else if (numDocsWithField == maxDoc) {
-  meta.writeLong(-1); // docsWithFieldOffset
-  meta.writeLong(0L); // docsWithFieldLength
-  meta.writeShort((short) -1); // jumpTableEntryCount
-  meta.writeByte((byte) -1);   // denseRankPower
-} else {
-  long offset = data.getFilePointer();
-  meta.writeLong(offset); // docsWithFieldOffset
-  values = valuesProducer.getBinary(field);
-  final short jumpTableEntryCount = IndexedDISI.writeBitSet(values, data, 
IndexedDISI.DEFAULT_DENSE_RANK_POWER);
-  meta.writeLong(data.getFilePointer() - offset); // docsWithFieldLength
-  meta.writeShort(jumpTableEntryCount);
-  meta.writeByte(IndexedDISI.DEFAULT_DENSE_RANK_POWER);
+void addDoc(int doc, BytesRef v) throws IOException {
+  if (blockAddressesStart < 0) {
+blockAddressesStart = data.getFilePointer();
+  }
+  docLengths[numDocsInCurrentBlock] = v.length;
+  block = ArrayUtil.grow(block, uncompressedBlockLength + v.length);
+  System.arraycopy(v.bytes, v.offset, block, uncompressedBlockLength, 
v.length);
+  uncompressedBlockLength += v.length;
+  numDocsInCurrentBlock++;
+  if (numDocsInCurrentBlock == 
Lucene80DocValuesFormat.BINARY_DOCS_PER_COMPRESSED_BLOCK) {
+flushData();
+  }  
 }
 
-meta.writeInt(numDocsWithField);
-meta.writeInt(minLength);
-meta.writeInt(maxLength);
-if (maxLength > minLength) {
-  start = data.getFilePointer();
-  meta.writeLong(start);
+private void flushData() throws IOException {
+  if (numDocsInCurrentBlock > 0) {
+// Write offset to this block to temporary offsets file
+totalChunks++;
+long thisBlockStartPointer = data.getFilePointer();
+
+// Optimisation - check if all lengths are same
+boolean allLengthsSame = true && numDocsInCurrentBlock >0  ;
+for (int i = 0; i < 
Lucene80DocValuesFormat.BINARY_DOCS_PER_COMPRESSED_BLOCK && allLengthsSame; 
i++) {
+  if (i > 0 && docLengths[i] != docLengths[i-1]) {
 
 Review comment:
   if you're only doing it for `i>0`, let's make the loop start at `i=1`?


This is an automated message from the Apache Git Service.
To respond to the 

[GitHub] [lucene-solr] jpountz commented on a change in pull request #1234: Add compression for Binary doc value fields

2020-02-14 Thread GitBox
jpountz commented on a change in pull request #1234: Add compression for Binary 
doc value fields
URL: https://github.com/apache/lucene-solr/pull/1234#discussion_r379306326
 
 

 ##
 File path: 
lucene/core/src/java/org/apache/lucene/codecs/lucene80/Lucene80DocValuesProducer.java
 ##
 @@ -182,10 +183,21 @@ private BinaryEntry readBinary(ChecksumIndexInput meta) 
throws IOException {
 entry.numDocsWithField = meta.readInt();
 entry.minLength = meta.readInt();
 entry.maxLength = meta.readInt();
-if (entry.minLength < entry.maxLength) {
+if ((version >= Lucene80DocValuesFormat.VERSION_BIN_COMPRESSED && 
entry.numDocsWithField >0)||  entry.minLength < entry.maxLength) {
   entry.addressesOffset = meta.readLong();
+
+  // Old count of uncompressed addresses 
+  long numAddresses = entry.numDocsWithField + 1L;
+  // New count of compressed addresses - the number of compresseed blocks
+  if (version >= Lucene80DocValuesFormat.VERSION_BIN_COMPRESSED) {
+entry.numCompressedChunks = meta.readVInt();
+entry.docsPerChunk = meta.readVInt();
 
 Review comment:
   maybe this should be the "shift" instead of the number of docs per chunk, so 
that you you directly have both the shift (as-is) and the mask `((1 << shift) - 
1)`


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9224) (ant) RAT report complains about ... solr/webapp rat-report.xml (from gradle)

2020-02-14 Thread Dawid Weiss (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17036777#comment-17036777
 ] 

Dawid Weiss commented on LUCENE-9224:
-

Thanks for digging, Erick. Chris - maintaining both ant and gradle on master is 
a bit of juggling. I'd really like to move to removing parts of the ant build 
as soon as possible. The precommit is still not fully equivalent; hope we'll 
get there soon.

> (ant) RAT report complains about ... solr/webapp rat-report.xml (from gradle)
> -
>
> Key: LUCENE-9224
> URL: https://issues.apache.org/jira/browse/LUCENE-9224
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Chris M. Hostetter
>Assignee: Erick Erickson
>Priority: Major
> Attachments: LUCENE-9224.patch
>
>
> I'm not sure if this is due to some comflagration of mixing gradle work with 
> ant work, but today i encountered the following failure after running "ant 
> clean precommit" ...
> {noformat}
> rat-sources:
>   [rat] 
>   [rat] *
>   [rat] Summary
>   [rat] ---
>   [rat] Generated at: 2020-02-13T14:46:10-07:00
>   [rat] Notes: 0
>   [rat] Binaries: 1
>   [rat] Archives: 0
>   [rat] Standards: 95
>   [rat] 
>   [rat] Apache Licensed: 75
>   [rat] Generated Documents: 0
>   [rat] 
>   [rat] JavaDocs are generated and so license header is optional
>   [rat] Generated files do not required license headers
>   [rat] 
>   [rat] 1 Unknown Licenses
>   [rat] 
>   [rat] ***
>   [rat] 
>   [rat] Unapproved licenses:
>   [rat] 
>   [rat]   /home/hossman/lucene/dev/solr/webapp/build/rat/rat-report.xml
>   [rat] 
>   [rat] ***
>   [rat] 
>   [rat] Archives:
>   [rat] 
>   [rat] *
>   [rat]   Files with Apache License headers will be marked AL
>   [rat]   Binary files (which do not require AL headers) will be marked B
>   [rat]   Compressed archives will be marked A
>   [rat]   Notices, licenses etc will be marked N
>   [rat]   AL/home/hossman/lucene/dev/solr/webapp/build.xml
>   [rat]  !? 
> /home/hossman/lucene/dev/solr/webapp/build/rat/rat-report.xml
>   [rat]   AL/home/hossman/lucene/dev/solr/webapp/web/WEB-INF/web.xml
> ...
> {noformat}
> RAT seems to be comlaining that there is no license header in it's own report 
> file?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org