[jira] [Commented] (SOLR-14256) Remove HashDocSet
[ https://issues.apache.org/jira/browse/SOLR-14256?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17037430#comment-17037430 ] David Smiley commented on SOLR-14256: - For better or worse, the code is on this PR for a related issue: https://github.com/apache/lucene-solr/pull/1257 > Remove HashDocSet > - > > Key: SOLR-14256 > URL: https://issues.apache.org/jira/browse/SOLR-14256 > Project: Solr > Issue Type: Task > Security Level: Public(Default Security Level. Issues are Public) > Components: search >Reporter: David Smiley >Priority: Major > > This particular DocSet is only used in places where we need to convert > SortedIntDocSet in particular to a DocSet that is fast for random access. > Once such a conversion happens, it's only used to test some docs for presence > and it could be another interface. DocSet has kind of a large-ish API > surface area to implement. Since we only need to test docs, we could use > Bits interface (having only 2 methods) backed by an off-the-shelf primitive > long hash set on our classpath. Perhaps a new method on DocSet: getBits() or > DocSetUtil.getBits(DocSet). > In addition to removing complexity unto itself, this improvement is required > by SOLR-14185 because it wants to be able to produce a DocIdSetIterator slice > directly from the DocSet but HashDocSet can't do that without sorting first. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9222) Detect upgrades with non-default formats
[ https://issues.apache.org/jira/browse/LUCENE-9222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17037429#comment-17037429 ] David Smiley commented on LUCENE-9222: -- > will always throw a IndexFormatTooOldException on upgrade, even between > minor versions Yeah, he's saying force this exception _even if there is no actual incompatibility_. Sorry but I really hate this idea; this format hasn't changed for _many_ releases. I think instead we need to remember to update the underlying postingsFormat name when a change does occur, _which is already versioned_ (e.g. "FST50" -> "FST84"). Maybe we should have a test to help us identify these breaks, because I totally appreciate it that it's not always clear. > Detect upgrades with non-default formats > > > Key: LUCENE-9222 > URL: https://issues.apache.org/jira/browse/LUCENE-9222 > Project: Lucene - Core > Issue Type: Wish >Reporter: Adrien Grand >Priority: Minor > > Lucene doesn't give any backward-compatibility guarantees with non-default > formats, but doesn't try to detect such misuse either, and a couple users > fell in this trap over the years, see e.g. SOLR-14254. > What about dynamically creating the version number of the index format based > on the current Lucene version, so that Lucene would fail with an > IndexFormatTooOldException with non-default formats instead of a confusing > CorruptIndexException. The change would consist of doing something like that > for all our non-default index formats: > {code} > diff --git > a/lucene/codecs/src/java/org/apache/lucene/codecs/memory/FSTTermsWriter.java > b/lucene/codecs/src/java/org/apache/lucene/codecs/memory/FSTTermsWriter.java > index fcc0d00a593..18b35760aec 100644 > --- > a/lucene/codecs/src/java/org/apache/lucene/codecs/memory/FSTTermsWriter.java > +++ > b/lucene/codecs/src/java/org/apache/lucene/codecs/memory/FSTTermsWriter.java > @@ -41,6 +41,7 @@ import org.apache.lucene.util.BytesRef; > import org.apache.lucene.util.FixedBitSet; > import org.apache.lucene.util.IOUtils; > import org.apache.lucene.util.IntsRefBuilder; > +import org.apache.lucene.util.Version; > import org.apache.lucene.util.fst.FSTCompiler; > import org.apache.lucene.util.fst.FST; > import org.apache.lucene.util.fst.Util; > @@ -123,7 +124,7 @@ import org.apache.lucene.util.fst.Util; > public class FSTTermsWriter extends FieldsConsumer { >static final String TERMS_EXTENSION = "tfp"; >static final String TERMS_CODEC_NAME = "FSTTerms"; > - public static final int TERMS_VERSION_START = 2; > + public static final int TERMS_VERSION_START = (Version.LATEST.major << 16) > | (Version.LATEST.minor << 8) | Version.LATEST.bugfix; >public static final int TERMS_VERSION_CURRENT = TERMS_VERSION_START; > >final PostingsWriterBase postingsWriter; > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] dsmiley commented on issue #1257: SOLR-14258: DocList should not extend DocSet
dsmiley commented on issue #1257: SOLR-14258: DocList should not extend DocSet URL: https://github.com/apache/lucene-solr/pull/1257#issuecomment-586556227 BTW I'm about to step away on vacation for a week with spotty internet access and no computer. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] dsmiley commented on issue #1257: SOLR-14258: DocList should not extend DocSet
dsmiley commented on issue #1257: SOLR-14258: DocList should not extend DocSet URL: https://github.com/apache/lucene-solr/pull/1257#issuecomment-586556130 @mkhludnev Thanks for the approval on the first commit about SOLR-14258 (DocList shouldn't implement DocSet). I forged ahead with SOLR-14256 (remove HashDocSet) on the same PR because I think in the end, one commit doing both is even more compelling in totality. In short: DocSet is now immutable and is now always in a natural document ID order. There are now only exactly two implementations (newly enforced). The guaranteed O(1) set detection is now handled with a getBits method, and this is more elegant for the callers that need it than the previous code idiom that had to do instanceof checks. The back-compat concern of all changes here is pretty low, I think. I'm less than 100% sure it's okay that SortedIntDocSet's getBits has a getLength that'll typically be less than maxDoc of the segment. Tests pass, so... okay? If we're not okay with this, we'll need the constructor to pass in maxDoc just as already occurs for FixBitSet used in BitDocSet. WDYT @yonik I could imagine abandoning SortedIntDocSet in favor of only BitDocSet with modifications to be more general by supporting SparsedFixedBitSet (thus both BitSet implementations). Or maybe practically speaking it'd need another class; I dunno. Definitely not something I want to explore at this time though. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9004) Approximate nearest vector search
[ https://issues.apache.org/jira/browse/LUCENE-9004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17037404#comment-17037404 ] Tomoko Uchida commented on LUCENE-9004: --- {quote}It can readily be shown that HNSW performs much better in query time. But I was surprised that top 1 in-set recall percent of HNSW is so low. It shouldn't be a problem of algorithm itself, but more likely a problem of implementation or test code. I will check it this weekend. {quote} Thanks [~irvingzhang] for measuring. I noticed I might have made a very basic mistake when comparing neighborhood nodes, maybe some inequality signs should be flipped :/ I will do recall performance tests with GloVE and fix the bugs. > Approximate nearest vector search > - > > Key: LUCENE-9004 > URL: https://issues.apache.org/jira/browse/LUCENE-9004 > Project: Lucene - Core > Issue Type: New Feature >Reporter: Michael Sokolov >Priority: Major > Attachments: hnsw_layered_graph.png > > Time Spent: 3h 20m > Remaining Estimate: 0h > > "Semantic" search based on machine-learned vector "embeddings" representing > terms, queries and documents is becoming a must-have feature for a modern > search engine. SOLR-12890 is exploring various approaches to this, including > providing vector-based scoring functions. This is a spinoff issue from that. > The idea here is to explore approximate nearest-neighbor search. Researchers > have found an approach based on navigating a graph that partially encodes the > nearest neighbor relation at multiple scales can provide accuracy > 95% (as > compared to exact nearest neighbor calculations) at a reasonable cost. This > issue will explore implementing HNSW (hierarchical navigable small-world) > graphs for the purpose of approximate nearest vector search (often referred > to as KNN or k-nearest-neighbor search). > At a high level the way this algorithm works is this. First assume you have a > graph that has a partial encoding of the nearest neighbor relation, with some > short and some long-distance links. If this graph is built in the right way > (has the hierarchical navigable small world property), then you can > efficiently traverse it to find nearest neighbors (approximately) in log N > time where N is the number of nodes in the graph. I believe this idea was > pioneered in [1]. The great insight in that paper is that if you use the > graph search algorithm to find the K nearest neighbors of a new document > while indexing, and then link those neighbors (undirectedly, ie both ways) to > the new document, then the graph that emerges will have the desired > properties. > The implementation I propose for Lucene is as follows. We need two new data > structures to encode the vectors and the graph. We can encode vectors using a > light wrapper around {{BinaryDocValues}} (we also want to encode the vector > dimension and have efficient conversion from bytes to floats). For the graph > we can use {{SortedNumericDocValues}} where the values we encode are the > docids of the related documents. Encoding the interdocument relations using > docids directly will make it relatively fast to traverse the graph since we > won't need to lookup through an id-field indirection. This choice limits us > to building a graph-per-segment since it would be impractical to maintain a > global graph for the whole index in the face of segment merges. However > graph-per-segment is a very natural at search time - we can traverse each > segments' graph independently and merge results as we do today for term-based > search. > At index time, however, merging graphs is somewhat challenging. While > indexing we build a graph incrementally, performing searches to construct > links among neighbors. When merging segments we must construct a new graph > containing elements of all the merged segments. Ideally we would somehow > preserve the work done when building the initial graphs, but at least as a > start I'd propose we construct a new graph from scratch when merging. The > process is going to be limited, at least initially, to graphs that can fit > in RAM since we require random access to the entire graph while constructing > it: In order to add links bidirectionally we must continually update existing > documents. > I think we want to express this API to users as a single joint > {{KnnGraphField}} abstraction that joins together the vectors and the graph > as a single joint field type. Mostly it just looks like a vector-valued > field, but has this graph attached to it. > I'll push a branch with my POC and would love to hear comments. It has many > nocommits, basic design is not really set, there is no Query implementation > and no integration iwth IndexSearcher, but it does work by some measure using > a
[jira] [Comment Edited] (LUCENE-9136) Introduce IVFFlat to Lucene for ANN similarity search
[ https://issues.apache.org/jira/browse/LUCENE-9136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17037325#comment-17037325 ] Julie Tibshirani edited comment on LUCENE-9136 at 2/15/20 1:44 AM: --- Hello [~irvingzhang], to me this looks like a really interesting direction! We also found in our research that k-means clustering (IVFFlat) could achieve high recall with a relatively low number of distance computations. It performs well compared to KD-trees and LSH, although it tends to require substantially more distance computations than HNSW. A nice property of the approach is that it's based on a classic algorithm, k-means – it is easy to understand, and has few tuning parameters. I wonder if this clustering-based approach could fit more closely in the current search framework. In the current prototype, we keep all the cluster information on-heap. We could instead try storing each cluster as its own 'term' with a postings list. The kNN query would then be modelled as an 'OR' over these terms. A major concern about clustering-based approaches is the high indexing cost. K-means is a heavy operation in itself. And even if we only use subsample of documents during k-means, we must compare each indexed document to all centroids to assign it to the right cluster. With the heuristic of using sqrt\(n) centroids, this could give poor scaling behavior when indexing large segments. Because of this concern, it could be nice to include benchmarks for index time (in addition to QPS). A couple more thoughts on this point: * FAISS helps address the concern by using an ANN algorithm to do the cluster assignment. In particular, it provides an option to use k-means clustering (IVFFlat), but do the cluster assignment using HNSW: [https://github.com/facebookresearch/faiss/wiki/Guidelines-to-choose-an-index#how-big-is-the-dataset]. This seemed like a potentially interesting direction. * There could also be ways to streamline the k-means step. As an example, I experimented with FAISS's implementation of IVFFlat, and found that I could run very few k-means iterations, but still achieve similar performance. Here are some results on a dataset of ~1.2 million GloVe word vectors, using sqrt\(n) centroids. The cell values represent recall for a kNN search with k=10: *{{approach 10 probes 20 probes 100 probes 200 probes}}* {{random centroids 0.578 0.68 0.902 0.961}} {{k-means, 1 iter 0.729 0.821 0.961 0.987}} {{k-means, 2 iters 0.775 0.854 0.968 0.989}} {{k-means, 20 iters 0.806 0.875 0.972 0.991}} was (Author: jtibshirani): Hello [~irvingzhang], to me this looks like a really interesting direction! We also found in our research that k-means clustering (IVFFlat) could achieve high recall with a relatively low number of distance computations. It performs well compared to KD-trees and LSH, although it tends to require substantially more distance computations than HNSW. A nice property of the approach is that it's based on a classic algorithm, k-means – it is easy to understand, and has few tuning parameters. I wonder if this clustering-based approach could fit more closely in the current search framework. In the current prototype, we keep all the cluster information on-heap. We could instead try storing each cluster as its own 'term' with a postings list. The kNN query would then be modelled as an 'OR' over these terms. A major concern about clustering-based approaches is the high indexing cost. K-means is a heavy operation in itself. And even if we only use subsample of documents during k-means, we must compare each indexed document to all centroids to assign it to the right cluster. With the heuristic of using sqrt\(n) centroids, this could give poor scaling behavior at index time. A couple thoughts on this point: * FAISS helps address this concern by using an ANN algorithm to do the cluster assignment. In particular, it provides an option to use k-means clustering (IVFFlat), but do the cluster assignment using HNSW: [https://github.com/facebookresearch/faiss/wiki/Guidelines-to-choose-an-index#how-big-is-the-dataset]. This seemed like a potentially interesting direction. * There could also be ways to streamline the k-means step. I experimented with FAISS's implementation of IVFFlat, and found that I could run very few k-means iterations, but still achieve similar performance. Here are some results on a dataset of ~1.2 million GloVe word vectors, using sqrt\(n) centroids. The cell values represent recall for a kNN search with k=10: *{{approach 10 probes 20 probes 100 probes 200 probes}}* {{random centroids 0.578 0.68 0.902 0.961}} {{k-means, 1 iter 0.729 0.821 0.961 0.987}} {{k-means, 2 iters 0.775 0.854 0.968 0.989}} {{k-means,
[GitHub] [lucene-solr] ErickErickson commented on issue #1022: SOLR-13953: Update eviction behavior of cache in Solr Prometheus exporter to allow for larger clusters
ErickErickson commented on issue #1022: SOLR-13953: Update eviction behavior of cache in Solr Prometheus exporter to allow for larger clusters URL: https://github.com/apache/lucene-solr/pull/1022#issuecomment-586538447 Forgot to close this when I fixed the JIRA. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] ErickErickson closed pull request #1022: SOLR-13953: Update eviction behavior of cache in Solr Prometheus exporter to allow for larger clusters
ErickErickson closed pull request #1022: SOLR-13953: Update eviction behavior of cache in Solr Prometheus exporter to allow for larger clusters URL: https://github.com/apache/lucene-solr/pull/1022 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (SOLR-14263) jvm-settings.adoc is way out of date
[ https://issues.apache.org/jira/browse/SOLR-14263?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Erick Erickson updated SOLR-14263: -- Attachment: SOLR-14263.patch Status: Open (was: Open) Here's an initial whack at changing this page. Feel free to wordsmith, I'm not entirely satisfied with it, I'll look some more later. > jvm-settings.adoc is way out of date > > > Key: SOLR-14263 > URL: https://issues.apache.org/jira/browse/SOLR-14263 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) > Components: documentation >Reporter: Erick Erickson >Assignee: Erick Erickson >Priority: Major > Attachments: SOLR-14263.patch > > > First of all it talks about a two gigabyte heap. Second, I thought we were > usually recommending -Xmx and -Xms be the same. I'll have a revision up > shortly, I'm thinking of some major surgery on it. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Created] (SOLR-14263) jvm-settings.adoc is way out of date
Erick Erickson created SOLR-14263: - Summary: jvm-settings.adoc is way out of date Key: SOLR-14263 URL: https://issues.apache.org/jira/browse/SOLR-14263 Project: Solr Issue Type: Improvement Security Level: Public (Default Security Level. Issues are Public) Components: documentation Reporter: Erick Erickson Assignee: Erick Erickson First of all it talks about a two gigabyte heap. Second, I thought we were usually recommending -Xmx and -Xms be the same. I'll have a revision up shortly, I'm thinking of some major surgery on it. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] athrog commented on issue #1261: SOLR-14260: Make SchemaRegistryProvider pluggable in HttpClientUtil
athrog commented on issue #1261: SOLR-14260: Make SchemaRegistryProvider pluggable in HttpClientUtil URL: https://github.com/apache/lucene-solr/pull/1261#issuecomment-586520791 FYI @dsmiley This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] athrog opened a new pull request #1261: SOLR-14260: Make SchemaRegistryProvider pluggable in HttpClientUtil
athrog opened a new pull request #1261: SOLR-14260: Make SchemaRegistryProvider pluggable in HttpClientUtil URL: https://github.com/apache/lucene-solr/pull/1261 # Description HttpClientUtil.java defines and uses an abstract SchemaRegistryProvider for mapping a protocol to an Apache ConnectionSocketFactory. There is only one implementation of this abstract class (outside of test cases). Currently, it is not override-able at runtime. # Solution Adds the ability to override the schema registry provider at runtime, using the class name value provided by "solr.schema.registry.provider", similar to how this class allows for choosing the HttpClientBuilderFactory at runtime. # Tests We have been using this patch in our internal fork of SOLR. We have verified that SOLR communication is encrypted, and since this patch helps us enable that encryption by setting a custom SSLContext for HTTP clients, we know this patch is working as expected. # Checklist Please review the following and check all that apply: - [x] I have reviewed the guidelines for [How to Contribute](https://wiki.apache.org/solr/HowToContribute) and my code conforms to the standards described there to the best of my ability. - [x] I have created a Jira issue and added the issue ID to my pull request title. - [x] I have given Solr maintainers [access](https://help.github.com/en/articles/allowing-changes-to-a-pull-request-branch-created-from-a-fork) to contribute to my PR branch. (optional but recommended) - [x] I have developed this patch against the `master` branch. - [x] I have run `ant precommit` and the appropriate test suite. - [ ] I have added tests for my changes. - [ ] I have added documentation for the [Ref Guide](https://github.com/apache/lucene-solr/tree/master/solr/solr-ref-guide) (for Solr changes only). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Resolved] (SOLR-13794) Delete solr/core/src/test-files/solr/configsets/_default
[ https://issues.apache.org/jira/browse/SOLR-13794?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris M. Hostetter resolved SOLR-13794. --- Fix Version/s: 8.5 master (9.0) Resolution: Fixed > Delete solr/core/src/test-files/solr/configsets/_default > > > Key: SOLR-13794 > URL: https://issues.apache.org/jira/browse/SOLR-13794 > Project: Solr > Issue Type: Test >Reporter: Chris M. Hostetter >Assignee: Chris M. Hostetter >Priority: Major > Fix For: master (9.0), 8.5 > > Attachments: SOLR-13794.patch, SOLR-13794.patch, > SOLR-13794_code_only.patch, SOLR-13794_code_only.patch > > > For as long as we've had a {{_default}} configset in solr, we've also had a > copy of that default in {{core/src/test-files/}} - as well as a unit test > that confirms they are identical. > It's never really been clear to me *why* we have this duplication, instead of > just having the test-framework take the necessary steps to ensure that > {{server/solr/configsets/_default}} is properly used when running tests. > I'd like to propose we eliminate the duplication since it only ever seems to > cause problems (notably spurious test failures when people modify the > {{_default}} configset w/o remembering that they need to make identical edits > to the {{test-files}} clone) and instead have {{SolrTestCase}} set the > (already existing & supported) {{solr.default.confdir}} system property to > point to the (already existing) {{ExternalPaths.DEFAULT_CONFIGSET}} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Comment Edited] (LUCENE-9136) Introduce IVFFlat to Lucene for ANN similarity search
[ https://issues.apache.org/jira/browse/LUCENE-9136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17037325#comment-17037325 ] Julie Tibshirani edited comment on LUCENE-9136 at 2/14/20 10:04 PM: Hello [~irvingzhang], to me this looks like a really interesting direction! We also found in our research that k-means clustering (IVFFlat) could achieve high recall with a relatively low number of distance computations. It performs well compared to KD-trees and LSH, although it tends to require substantially more distance computations than HNSW. A nice property of the approach is that it's based on a classic algorithm, k-means – it is easy to understand, and has few tuning parameters. I wonder if this clustering-based approach could fit more closely in the current search framework. In the current prototype, we keep all the cluster information on-heap. We could instead try storing each cluster as its own 'term' with a postings list. The kNN query would then be modelled as an 'OR' over these terms. A major concern about clustering-based approaches is the high indexing cost. K-means is a heavy operation in itself. And even if we only use subsample of documents during k-means, we must compare each indexed document to all centroids to assign it to the right cluster. With the heuristic of using sqrt\(n) centroids, this could give poor scaling behavior at index time. A couple thoughts on this point: * FAISS helps address this concern by using an ANN algorithm to do the cluster assignment. In particular, it provides an option to use k-means clustering (IVFFlat), but do the cluster assignment using HNSW: [https://github.com/facebookresearch/faiss/wiki/Guidelines-to-choose-an-index#how-big-is-the-dataset]. This seemed like a potentially interesting direction. * There could also be ways to streamline the k-means step. I experimented with FAISS's implementation of IVFFlat, and found that I could run very few k-means iterations, but still achieve similar performance. Here are some results on a dataset of ~1.2 million GloVe word vectors, using sqrt\(n) centroids. The cell values represent recall for a kNN search with k=10: *{{approach 10 probes 20 probes 100 probes 200 probes}}* {{random centroids 0.578 0.68 0.902 0.961}} {{k-means, 1 iter 0.729 0.821 0.961 0.987}} {{k-means, 2 iters 0.775 0.854 0.968 0.989}} {{k-means, 20 iters 0.806 0.875 0.972 0.991}} was (Author: jtibshirani): Hello [~irvingzhang], to me this looks like a really interesting direction! We also found in our research that k-means clustering (IVFFlat) could achieve high recall with a relatively low number of distance computations. It performs well compared to KD-trees and LSH, although it tends to require substantially more distance computations than HNSW. A nice property of the approach is that it's based on a classic algorithm, k-means – it is easy to understand, and has few tuning parameters. I wonder if this clustering-based approach could fit more closely in the current search framework. In the current prototype, we keep all the cluster information on-heap. We could instead try storing each cluster as its own 'term' with a postings list. The kNN query would then be modelled as an 'OR' over these terms. A major concern about clustering-based approaches is the high indexing cost. K-means is a heavy operation in itself. And even if we only use subsample of documents during k-means, we must compare each indexed document to all centroids to assign it to the right cluster. With the heuristic of using sqrt(n) centroids, this could give poor scaling behavior at index time. A couple thoughts on this point: * FAISS helps address this concern by using an ANN algorithm to do the cluster assignment. In particular, it provides an option to use k-means clustering (IVFFlat), but do the cluster assignment using HNSW: [https://github.com/facebookresearch/faiss/wiki/Guidelines-to-choose-an-index#how-big-is-the-dataset]. This seemed like a potentially interesting direction. * There could also be ways to streamline the k-means step. I experimented with FAISS's implementation of IVFFlat, and found that I could run very few k-means iterations, but still achieve similar performance. Here are some results on a dataset of ~1.2 million GloVe word vectors, using sqrt(n) centroids. The cell values represent recall for a kNN search with k=10: *{{approach 10 probes 20 probes 100 probes 200 probes}}* {{random centroids 0.578 0.68 0.902 0.961}} {{k-means, 1 iter 0.729 0.821 0.961 0.987}} {{k-means, 2 iters 0.775 0.854 0.968 0.989}} {{k-means, 20 iters 0.806 0.875 0.972 0.991}} > Introduce IVFFlat to Lucene for ANN similarity search >
[jira] [Comment Edited] (LUCENE-9136) Introduce IVFFlat to Lucene for ANN similarity search
[ https://issues.apache.org/jira/browse/LUCENE-9136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17037325#comment-17037325 ] Julie Tibshirani edited comment on LUCENE-9136 at 2/14/20 10:03 PM: Hello [~irvingzhang], to me this looks like a really interesting direction! We also found in our research that k-means clustering (IVFFlat) could achieve high recall with a relatively low number of distance computations. It performs well compared to KD-trees and LSH, although it tends to require substantially more distance computations than HNSW. A nice property of the approach is that it's based on a classic algorithm, k-means – it is easy to understand, and has few tuning parameters. I wonder if this clustering-based approach could fit more closely in the current search framework. In the current prototype, we keep all the cluster information on-heap. We could instead try storing each cluster as its own 'term' with a postings list. The kNN query would then be modelled as an 'OR' over these terms. A major concern about clustering-based approaches is the high indexing cost. K-means is a heavy operation in itself. And even if we only use subsample of documents during k-means, we must compare each indexed document to all centroids to assign it to the right cluster. With the heuristic of using sqrt(n) centroids, this could give poor scaling behavior at index time. A couple thoughts on this point: * FAISS helps address this concern by using an ANN algorithm to do the cluster assignment. In particular, it provides an option to use k-means clustering (IVFFlat), but do the cluster assignment using HNSW: [https://github.com/facebookresearch/faiss/wiki/Guidelines-to-choose-an-index#how-big-is-the-dataset]. This seemed like a potentially interesting direction. * There could also be ways to streamline the k-means step. I experimented with FAISS's implementation of IVFFlat, and found that I could run very few k-means iterations, but still achieve similar performance. Here are some results on a dataset of ~1.2 million GloVe word vectors, using sqrt(n) centroids. The cell values represent recall for a kNN search with k=10: *{{approach 10 probes 20 probes 100 probes 200 probes}}* {{random centroids 0.578 0.68 0.902 0.961}} {{k-means, 1 iter 0.729 0.821 0.961 0.987}} {{k-means, 2 iters 0.775 0.854 0.968 0.989}} {{k-means, 20 iters 0.806 0.875 0.972 0.991}} was (Author: jtibshirani): Hello [~irvingzhang], to me this looks like a really interesting direction! We also found in our research that k-means clustering (IVFFlat) could achieve high recall with a relatively low number of distance computations. It performs well compared to KD-trees and LSH, although it tends to require more distance computations than HNSW. A nice property of the approach is that it's based on a classic algorithm, k-means -- it is easy to understand, and has few tuning parameters. I wonder if this clustering-based approach could fit more closely in the current search framework. In the current prototype, we keep all the cluster information on-heap. We could instead try storing each cluster as its own 'term' with a postings list. The kNN query would then be modelled as an 'OR' over these terms. A major concern about clustering-based approaches is the high indexing cost. K-means is a heavy operation in itself. And even if we only use subsample of documents during k-means, we must compare each indexed document to all centroids to assign it to the right cluster. With the heuristic of using sqrt(n) centroids, this could give poor scaling behavior at index time. A couple thoughts on this point: * FAISS helps address this concern by using an ANN algorithm to do the cluster assignment. In particular, it provides an option to use k-means clustering (IVFFlat), but do the cluster assignment using HNSW: https://github.com/facebookresearch/faiss/wiki/Guidelines-to-choose-an-index#how-big-is-the-dataset. This seemed like a potentially interesting direction. * There could also be ways to streamline the k-means step. I experimented with FAISS's implementation of IVFFlat, and found that I could run very few k-means iterations, but still achieve similar performance. Here are some results on a dataset of ~1.2 million GloVe word vectors, using sqrt(n) centroids. The cell values represent recall for a kNN search with k=10: *{{approach 10 probes 20 probes 100 probes 200 probes}}* {{random centroids 0.578 0.68 0.902 0.961}} {{k-means, 1 iter 0.729 0.821 0.961 0.987}} {{k-means, 2 iters 0.775 0.854 0.968 0.989}} {{k-means, 20 iters 0.806 0.875 0.972 0.991}} > Introduce IVFFlat to Lucene for ANN similarity search >
[jira] [Commented] (LUCENE-9136) Introduce IVFFlat to Lucene for ANN similarity search
[ https://issues.apache.org/jira/browse/LUCENE-9136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17037325#comment-17037325 ] Julie Tibshirani commented on LUCENE-9136: -- Hello [~irvingzhang], to me this looks like a really interesting direction! We also found in our research that k-means clustering (IVFFlat) could achieve high recall with a relatively low number of distance computations. It performs well compared to KD-trees and LSH, although it tends to require more distance computations than HNSW. A nice property of the approach is that it's based on a classic algorithm, k-means -- it is easy to understand, and has few tuning parameters. I wonder if this clustering-based approach could fit more closely in the current search framework. In the current prototype, we keep all the cluster information on-heap. We could instead try storing each cluster as its own 'term' with a postings list. The kNN query would then be modelled as an 'OR' over these terms. A major concern about clustering-based approaches is the high indexing cost. K-means is a heavy operation in itself. And even if we only use subsample of documents during k-means, we must compare each indexed document to all centroids to assign it to the right cluster. With the heuristic of using sqrt(n) centroids, this could give poor scaling behavior at index time. A couple thoughts on this point: * FAISS helps address this concern by using an ANN algorithm to do the cluster assignment. In particular, it provides an option to use k-means clustering (IVFFlat), but do the cluster assignment using HNSW: https://github.com/facebookresearch/faiss/wiki/Guidelines-to-choose-an-index#how-big-is-the-dataset. This seemed like a potentially interesting direction. * There could also be ways to streamline the k-means step. I experimented with FAISS's implementation of IVFFlat, and found that I could run very few k-means iterations, but still achieve similar performance. Here are some results on a dataset of ~1.2 million GloVe word vectors, using sqrt(n) centroids. The cell values represent recall for a kNN search with k=10: *{{approach 10 probes 20 probes 100 probes 200 probes}}* {{random centroids 0.578 0.68 0.902 0.961}} {{k-means, 1 iter 0.729 0.821 0.961 0.987}} {{k-means, 2 iters 0.775 0.854 0.968 0.989}} {{k-means, 20 iters 0.806 0.875 0.972 0.991}} > Introduce IVFFlat to Lucene for ANN similarity search > - > > Key: LUCENE-9136 > URL: https://issues.apache.org/jira/browse/LUCENE-9136 > Project: Lucene - Core > Issue Type: New Feature >Reporter: Xin-Chun Zhang >Priority: Major > Attachments: 1581409981369-9dea4099-4e41-4431-8f45-a3bb8cac46c0.png > > > Representation learning (RL) has been an established discipline in the > machine learning space for decades but it draws tremendous attention lately > with the emergence of deep learning. The central problem of RL is to > determine an optimal representation of the input data. By embedding the data > into a high dimensional vector, the vector retrieval (VR) method is then > applied to search the relevant items. > With the rapid development of RL over the past few years, the technique has > been used extensively in industry from online advertising to computer vision > and speech recognition. There exist many open source implementations of VR > algorithms, such as Facebook's FAISS and Microsoft's SPTAG, providing various > choices for potential users. However, the aforementioned implementations are > all written in C++, and no plan for supporting Java interface, making it hard > to be integrated in Java projects or those who are not familier with C/C++ > [[https://github.com/facebookresearch/faiss/issues/105]]. > The algorithms for vector retrieval can be roughly classified into four > categories, > # Tree-base algorithms, such as KD-tree; > # Hashing methods, such as LSH (Local Sensitive Hashing); > # Product quantization based algorithms, such as IVFFlat; > # Graph-base algorithms, such as HNSW, SSG, NSG; > where IVFFlat and HNSW are the most popular ones among all the VR algorithms. > IVFFlat is better for high-precision applications such as face recognition, > while HNSW performs better in general scenarios including recommendation and > personalized advertisement. *The recall ratio of IVFFlat could be gradually > increased by adjusting the query parameter (nprobe), while it's hard for HNSW > to improve its accuracy*. In theory, IVFFlat could achieve 100% recall ratio. > Recently, the implementation of HNSW (Hierarchical Navigable Small World, > LUCENE-9004) for Lucene, has made great progress. The issue draws attention > of those who are interested in Lucene or hope
[GitHub] [lucene-solr] HoustonPutman commented on issue #1260: SOLR-13669: DIH: Add System property toggle for use of dataConfig param
HoustonPutman commented on issue #1260: SOLR-13669: DIH: Add System property toggle for use of dataConfig param URL: https://github.com/apache/lucene-solr/pull/1260#issuecomment-586484779 Ok, so the added tests work, and precommit passes. This should be good to go. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] HoustonPutman commented on issue #1260: SOLR-13669: DIH: Add System property toggle for use of dataConfig param
HoustonPutman commented on issue #1260: SOLR-13669: DIH: Add System property toggle for use of dataConfig param URL: https://github.com/apache/lucene-solr/pull/1260#issuecomment-586480503 The license file name wasn't updated when the library name was changed in the previous commit. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (SOLR-13020) Additional Hooks in SolrEventListener
[ https://issues.apache.org/jira/browse/SOLR-13020?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17037294#comment-17037294 ] David Smiley commented on SOLR-13020: - SolrEventListener seems to be of little use for those UpdateCommands given that you could just add an URP; no? > Additional Hooks in SolrEventListener > - > > Key: SOLR-13020 > URL: https://issues.apache.org/jira/browse/SOLR-13020 > Project: Solr > Issue Type: Improvement > Components: Plugin system >Reporter: Kevin Jia >Priority: Minor > > Add more hooks in SolrEventListener to allow for greater user customization. > Proposed hooks are: > public void postCoreConstruct(SolrCore core); > public void preAddDoc(AddUpdateCommand cmd); > public void postAddDoc(AddUpdateCommand cmd); > public void preDelete(DeleteUpdateCommand cmd); > public void postDelete(DeleteUpdateCommand cmd); -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9222) Detect upgrades with non-default formats
[ https://issues.apache.org/jira/browse/LUCENE-9222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17037263#comment-17037263 ] Cassandra Targett commented on LUCENE-9222: --- I'm trying to wrap my head around this, but don't really get the idea. Is this proposal effectively saying that any index using a non-default codec will always throw a IndexFormatTooOldException on upgrade, even between minor versions? Or is it more about improving the type of error that is thrown if the non-default codec has been changed after an upgrade? > Detect upgrades with non-default formats > > > Key: LUCENE-9222 > URL: https://issues.apache.org/jira/browse/LUCENE-9222 > Project: Lucene - Core > Issue Type: Wish >Reporter: Adrien Grand >Priority: Minor > > Lucene doesn't give any backward-compatibility guarantees with non-default > formats, but doesn't try to detect such misuse either, and a couple users > fell in this trap over the years, see e.g. SOLR-14254. > What about dynamically creating the version number of the index format based > on the current Lucene version, so that Lucene would fail with an > IndexFormatTooOldException with non-default formats instead of a confusing > CorruptIndexException. The change would consist of doing something like that > for all our non-default index formats: > {code} > diff --git > a/lucene/codecs/src/java/org/apache/lucene/codecs/memory/FSTTermsWriter.java > b/lucene/codecs/src/java/org/apache/lucene/codecs/memory/FSTTermsWriter.java > index fcc0d00a593..18b35760aec 100644 > --- > a/lucene/codecs/src/java/org/apache/lucene/codecs/memory/FSTTermsWriter.java > +++ > b/lucene/codecs/src/java/org/apache/lucene/codecs/memory/FSTTermsWriter.java > @@ -41,6 +41,7 @@ import org.apache.lucene.util.BytesRef; > import org.apache.lucene.util.FixedBitSet; > import org.apache.lucene.util.IOUtils; > import org.apache.lucene.util.IntsRefBuilder; > +import org.apache.lucene.util.Version; > import org.apache.lucene.util.fst.FSTCompiler; > import org.apache.lucene.util.fst.FST; > import org.apache.lucene.util.fst.Util; > @@ -123,7 +124,7 @@ import org.apache.lucene.util.fst.Util; > public class FSTTermsWriter extends FieldsConsumer { >static final String TERMS_EXTENSION = "tfp"; >static final String TERMS_CODEC_NAME = "FSTTerms"; > - public static final int TERMS_VERSION_START = 2; > + public static final int TERMS_VERSION_START = (Version.LATEST.major << 16) > | (Version.LATEST.minor << 8) | Version.LATEST.bugfix; >public static final int TERMS_VERSION_CURRENT = TERMS_VERSION_START; > >final PostingsWriterBase postingsWriter; > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (SOLR-13020) Additional Hooks in SolrEventListener
[ https://issues.apache.org/jira/browse/SOLR-13020?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17037254#comment-17037254 ] Kevin Jia commented on SOLR-13020: -- Very well said Akshay. (y) > Additional Hooks in SolrEventListener > - > > Key: SOLR-13020 > URL: https://issues.apache.org/jira/browse/SOLR-13020 > Project: Solr > Issue Type: Improvement > Components: Plugin system >Reporter: Kevin Jia >Priority: Minor > > Add more hooks in SolrEventListener to allow for greater user customization. > Proposed hooks are: > public void postCoreConstruct(SolrCore core); > public void preAddDoc(AddUpdateCommand cmd); > public void postAddDoc(AddUpdateCommand cmd); > public void preDelete(DeleteUpdateCommand cmd); > public void postDelete(DeleteUpdateCommand cmd); -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] dweiss commented on issue #1260: SOLR-13669: DIH: Add System property toggle for use of dataConfig param
dweiss commented on issue #1260: SOLR-13669: DIH: Add System property toggle for use of dataConfig param URL: https://github.com/apache/lucene-solr/pull/1260#issuecomment-586438839 You need to copy this license (and checksum) from master or 8x? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] HoustonPutman commented on issue #1260: SOLR-13669: DIH: Add System property toggle for use of dataConfig param
HoustonPutman commented on issue #1260: SOLR-13669: DIH: Add System property toggle for use of dataConfig param URL: https://github.com/apache/lucene-solr/pull/1260#issuecomment-586434232 I can't `ant precommit`, do to an issue with a missing license in `solr/contrib/clustering/lib/simple-xml-safe-2.7.1.jar`. 99% sure the error isn't because of this commit. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (SOLR-13794) Delete solr/core/src/test-files/solr/configsets/_default
[ https://issues.apache.org/jira/browse/SOLR-13794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17037209#comment-17037209 ] ASF subversion and git services commented on SOLR-13794: Commit ea20c9a001bde55dd58f0e48abbd6aa44dc2c858 in lucene-solr's branch refs/heads/branch_8x from Chris M. Hostetter [ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=ea20c9a ] SOLR-13794: Replace redundent test only copy of '_default' configset with SolrTestCase logic to correctly set 'solr.default.confdir' system property This change allows us to remove kludgy test only code from ZkController (cherry picked from commit f549ee353530fcd48390a314aff9ec1723b47346) > Delete solr/core/src/test-files/solr/configsets/_default > > > Key: SOLR-13794 > URL: https://issues.apache.org/jira/browse/SOLR-13794 > Project: Solr > Issue Type: Test >Reporter: Chris M. Hostetter >Assignee: Chris M. Hostetter >Priority: Major > Attachments: SOLR-13794.patch, SOLR-13794.patch, > SOLR-13794_code_only.patch, SOLR-13794_code_only.patch > > > For as long as we've had a {{_default}} configset in solr, we've also had a > copy of that default in {{core/src/test-files/}} - as well as a unit test > that confirms they are identical. > It's never really been clear to me *why* we have this duplication, instead of > just having the test-framework take the necessary steps to ensure that > {{server/solr/configsets/_default}} is properly used when running tests. > I'd like to propose we eliminate the duplication since it only ever seems to > cause problems (notably spurious test failures when people modify the > {{_default}} configset w/o remembering that they need to make identical edits > to the {{test-files}} clone) and instead have {{SolrTestCase}} set the > (already existing & supported) {{solr.default.confdir}} system property to > point to the (already existing) {{ExternalPaths.DEFAULT_CONFIGSET}} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] HoustonPutman opened a new pull request #1260: SOLR-13669: DIH: Add System property toggle for use of dataConfig param
HoustonPutman opened a new pull request #1260: SOLR-13669: DIH: Add System property toggle for use of dataConfig param URL: https://github.com/apache/lucene-solr/pull/1260 (cherry picked from commit 325824cd391c8e71f36f17d687f52344e50e9715) Addresses [CVE-2019-0193](https://nvd.nist.gov/vuln/detail/CVE-2019-0193) / [SOLR-13669 ](https://issues.apache.org/jira/browse/SOLR-13669) Backport from Solr 8.1.2 # Checklist Please review the following and check all that apply: - [ ] I have reviewed the guidelines for [How to Contribute](https://wiki.apache.org/solr/HowToContribute) and my code conforms to the standards described there to the best of my ability. - [ ] I have created a Jira issue and added the issue ID to my pull request title. - [ ] I have given Solr maintainers [access](https://help.github.com/en/articles/allowing-changes-to-a-pull-request-branch-created-from-a-fork) to contribute to my PR branch. (optional but recommended) - [ ] I have developed this patch against the `master` branch. - [ ] I have run `ant precommit` and the appropriate test suite. - [ ] I have added tests for my changes. - [ ] I have added documentation for the [Ref Guide](https://github.com/apache/lucene-solr/tree/master/solr/solr-ref-guide) (for Solr changes only). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (SOLR-13794) Delete solr/core/src/test-files/solr/configsets/_default
[ https://issues.apache.org/jira/browse/SOLR-13794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17037193#comment-17037193 ] ASF subversion and git services commented on SOLR-13794: Commit f549ee353530fcd48390a314aff9ec1723b47346 in lucene-solr's branch refs/heads/master from Chris M. Hostetter [ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=f549ee3 ] SOLR-13794: Replace redundent test only copy of '_default' configset with SolrTestCase logic to correctly set 'solr.default.confdir' system property This change allows us to remove kludgy test only code from ZkController > Delete solr/core/src/test-files/solr/configsets/_default > > > Key: SOLR-13794 > URL: https://issues.apache.org/jira/browse/SOLR-13794 > Project: Solr > Issue Type: Test >Reporter: Chris M. Hostetter >Assignee: Chris M. Hostetter >Priority: Major > Attachments: SOLR-13794.patch, SOLR-13794.patch, > SOLR-13794_code_only.patch, SOLR-13794_code_only.patch > > > For as long as we've had a {{_default}} configset in solr, we've also had a > copy of that default in {{core/src/test-files/}} - as well as a unit test > that confirms they are identical. > It's never really been clear to me *why* we have this duplication, instead of > just having the test-framework take the necessary steps to ensure that > {{server/solr/configsets/_default}} is properly used when running tests. > I'd like to propose we eliminate the duplication since it only ever seems to > cause problems (notably spurious test failures when people modify the > {{_default}} configset w/o remembering that they need to make identical edits > to the {{test-files}} clone) and instead have {{SolrTestCase}} set the > (already existing & supported) {{solr.default.confdir}} system property to > point to the (already existing) {{ExternalPaths.DEFAULT_CONFIGSET}} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] dsmiley commented on issue #1191: SOLR-14197 Reduce API of SolrResourceLoader
dsmiley commented on issue #1191: SOLR-14197 Reduce API of SolrResourceLoader URL: https://github.com/apache/lucene-solr/pull/1191#issuecomment-586399204 FYI tomorrow I'll be on vacation for a week with minimal internet access, so I may not respond much for a bit. Again, I think the issue is ready for review. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (SOLR-13951) Avoid replica state updates to state.json
[ https://issues.apache.org/jira/browse/SOLR-13951?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17037122#comment-17037122 ] David Smiley commented on SOLR-13951: - I'm really happy to see that this will reduce the use of the Overseer. Might we consider using Apache Curator for ZK interaction on work here? My understanding is that this won't be a big-bang refactor; instead it'll be gradual. It's already on the classpath (some authentication stuff, surprisingly) > Avoid replica state updates to state.json > -- > > Key: SOLR-13951 > URL: https://issues.apache.org/jira/browse/SOLR-13951 > Project: Solr > Issue Type: Bug >Reporter: Noble Paul >Assignee: Noble Paul >Priority: Major > > This can dramatically improve the scalability of Solr and minimize load on > Overseer > See this doc for details > https://docs.google.com/document/d/1FoPVxiVrbfoSpMqZZRGjBy_jrLI26qhWwUO_aQQ0KRQ/edit?usp=sharing -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] tflobbe commented on a change in pull request #1247: SOLR-14252 use double rather than Double to avoid NPE
tflobbe commented on a change in pull request #1247: SOLR-14252 use double rather than Double to avoid NPE URL: https://github.com/apache/lucene-solr/pull/1247#discussion_r379543081 ## File path: solr/core/src/java/org/apache/solr/metrics/AggregateMetric.java ## @@ -93,18 +115,11 @@ public double getMax() { if (values.isEmpty()) { return 0; } -Double res = null; +double res = 0; Review comment: Maybe @sigram ? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] tflobbe commented on a change in pull request #1247: SOLR-14252 use double rather than Double to avoid NPE
tflobbe commented on a change in pull request #1247: SOLR-14252 use double rather than Double to avoid NPE URL: https://github.com/apache/lucene-solr/pull/1247#discussion_r379542807 ## File path: solr/core/src/java/org/apache/solr/metrics/AggregateMetric.java ## @@ -51,6 +61,18 @@ public String toString() { ", updateCount=" + updateCount + '}'; } + +public double toDouble() { Review comment: Sure, but since this method is new lets make it private from start This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Resolved] (LUCENE-9226) Point2D#relateTriangle bug
[ https://issues.apache.org/jira/browse/LUCENE-9226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ignacio Vera resolved LUCENE-9226. -- Fix Version/s: 8.5 Assignee: Ignacio Vera Resolution: Fixed master: ebec456602e480101661cfbd0821fbb397f83d8c branch_8x: f3c81d76b060f3bdce8cb6dc4c3c8b1cb8fe24cf I didn't add an entry in CHANGES.txt as it is an unreleased feature > Point2D#relateTriangle bug > -- > > Key: LUCENE-9226 > URL: https://issues.apache.org/jira/browse/LUCENE-9226 > Project: Lucene - Core > Issue Type: Bug >Reporter: Ignacio Vera >Assignee: Ignacio Vera >Priority: Major > Fix For: 8.5 > > Time Spent: 0.5h > Remaining Estimate: 0h > > The current implementation returns CELL_INSIDE_QUERY when point lies inside > the triangle. It should return CELL_CROSSES_QUERY instead. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] iverase commented on issue #1259: LUCENE-9226: Return CELL_CROSSES_QUERY when point inside the triangle
iverase commented on issue #1259: LUCENE-9226: Return CELL_CROSSES_QUERY when point inside the triangle URL: https://github.com/apache/lucene-solr/pull/1259#issuecomment-586353925 I push it as it is trivial This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] iverase merged pull request #1259: LUCENE-9226: Return CELL_CROSSES_QUERY when point inside the triangle
iverase merged pull request #1259: LUCENE-9226: Return CELL_CROSSES_QUERY when point inside the triangle URL: https://github.com/apache/lucene-solr/pull/1259 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Resolved] (LUCENE-9224) (ant) RAT report complains about ... solr/webapp rat-report.xml (from gradle)
[ https://issues.apache.org/jira/browse/LUCENE-9224?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Erick Erickson resolved LUCENE-9224. Fix Version/s: master (9.0) Resolution: Fixed Pushed a fix just to reduce friction... > (ant) RAT report complains about ... solr/webapp rat-report.xml (from gradle) > - > > Key: LUCENE-9224 > URL: https://issues.apache.org/jira/browse/LUCENE-9224 > Project: Lucene - Core > Issue Type: Bug >Reporter: Chris M. Hostetter >Assignee: Erick Erickson >Priority: Major > Fix For: master (9.0) > > Attachments: LUCENE-9224.patch > > > I'm not sure if this is due to some comflagration of mixing gradle work with > ant work, but today i encountered the following failure after running "ant > clean precommit" ... > {noformat} > rat-sources: > [rat] > [rat] * > [rat] Summary > [rat] --- > [rat] Generated at: 2020-02-13T14:46:10-07:00 > [rat] Notes: 0 > [rat] Binaries: 1 > [rat] Archives: 0 > [rat] Standards: 95 > [rat] > [rat] Apache Licensed: 75 > [rat] Generated Documents: 0 > [rat] > [rat] JavaDocs are generated and so license header is optional > [rat] Generated files do not required license headers > [rat] > [rat] 1 Unknown Licenses > [rat] > [rat] *** > [rat] > [rat] Unapproved licenses: > [rat] > [rat] /home/hossman/lucene/dev/solr/webapp/build/rat/rat-report.xml > [rat] > [rat] *** > [rat] > [rat] Archives: > [rat] > [rat] * > [rat] Files with Apache License headers will be marked AL > [rat] Binary files (which do not require AL headers) will be marked B > [rat] Compressed archives will be marked A > [rat] Notices, licenses etc will be marked N > [rat] AL/home/hossman/lucene/dev/solr/webapp/build.xml > [rat] !? > /home/hossman/lucene/dev/solr/webapp/build/rat/rat-report.xml > [rat] AL/home/hossman/lucene/dev/solr/webapp/web/WEB-INF/web.xml > ... > {noformat} > RAT seems to be comlaining that there is no license header in it's own report > file? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9224) (ant) RAT report complains about ... solr/webapp rat-report.xml (from gradle)
[ https://issues.apache.org/jira/browse/LUCENE-9224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17037070#comment-17037070 ] ASF subversion and git services commented on LUCENE-9224: - Commit f52676cd82576eb6231114ffc74bf9a653f92953 in lucene-solr's branch refs/heads/master from Erick Erickson [ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=f52676c ] LUCENE-9224: (ant) RAT report complains about ... solr/webapp rat-report.xml (from gradle) > (ant) RAT report complains about ... solr/webapp rat-report.xml (from gradle) > - > > Key: LUCENE-9224 > URL: https://issues.apache.org/jira/browse/LUCENE-9224 > Project: Lucene - Core > Issue Type: Bug >Reporter: Chris M. Hostetter >Assignee: Erick Erickson >Priority: Major > Attachments: LUCENE-9224.patch > > > I'm not sure if this is due to some comflagration of mixing gradle work with > ant work, but today i encountered the following failure after running "ant > clean precommit" ... > {noformat} > rat-sources: > [rat] > [rat] * > [rat] Summary > [rat] --- > [rat] Generated at: 2020-02-13T14:46:10-07:00 > [rat] Notes: 0 > [rat] Binaries: 1 > [rat] Archives: 0 > [rat] Standards: 95 > [rat] > [rat] Apache Licensed: 75 > [rat] Generated Documents: 0 > [rat] > [rat] JavaDocs are generated and so license header is optional > [rat] Generated files do not required license headers > [rat] > [rat] 1 Unknown Licenses > [rat] > [rat] *** > [rat] > [rat] Unapproved licenses: > [rat] > [rat] /home/hossman/lucene/dev/solr/webapp/build/rat/rat-report.xml > [rat] > [rat] *** > [rat] > [rat] Archives: > [rat] > [rat] * > [rat] Files with Apache License headers will be marked AL > [rat] Binary files (which do not require AL headers) will be marked B > [rat] Compressed archives will be marked A > [rat] Notices, licenses etc will be marked N > [rat] AL/home/hossman/lucene/dev/solr/webapp/build.xml > [rat] !? > /home/hossman/lucene/dev/solr/webapp/build/rat/rat-report.xml > [rat] AL/home/hossman/lucene/dev/solr/webapp/web/WEB-INF/web.xml > ... > {noformat} > RAT seems to be comlaining that there is no license header in it's own report > file? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8321) Allow composite readers to have more than 2B documents
[ https://issues.apache.org/jira/browse/LUCENE-8321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17037035#comment-17037035 ] Erick Erickson commented on LUCENE-8321: Part of the rabbit hole would be the number of segments. TMP has a default segment size cap of 5G for instance. We could certainly up that or create a new merge policy for indexes with lots of docs... On a separate note I've seen instances of terabyte-scale indexes on disk. Allowing that to grow by a factor of 8 would be another part of the rabbit hole. That said, I'm not against the idea at all. I'm pretty sure operational issues would pop out, but that's progress... > Allow composite readers to have more than 2B documents > -- > > Key: LUCENE-8321 > URL: https://issues.apache.org/jira/browse/LUCENE-8321 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Adrien Grand >Priority: Minor > > I would like to start discussing removing the limit of ~2B documents that we > have for indices, while still enforcing it at the segment level for practical > reasons. > Postings, stored fields, and all other codec APIs would keep working on > integers to represent doc ids. Only top-level doc ids and numbers of > documents would need to move to a long. I say "only" because we now mostly > consume indices per-segment, but there is still a number of places where we > identify documents by their top-level doc ID like {{IndexReader#document}}, > top-docs collectors, etc. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] iverase opened a new pull request #1259: LUCENE-9226: Return CELL_CROSSES_QUERY when point inside the triangle
iverase opened a new pull request #1259: LUCENE-9226: Return CELL_CROSSES_QUERY when point inside the triangle URL: https://github.com/apache/lucene-solr/pull/1259 return the right relation when point side a triangle. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Created] (LUCENE-9226) Point2D#relateTriangle bug
Ignacio Vera created LUCENE-9226: Summary: Point2D#relateTriangle bug Key: LUCENE-9226 URL: https://issues.apache.org/jira/browse/LUCENE-9226 Project: Lucene - Core Issue Type: Bug Reporter: Ignacio Vera The current implementation returns CELL_INSIDE_QUERY when point lies inside the triangle. It should return CELL_CROSSES_QUERY instead. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] jpountz commented on a change in pull request #1234: Add compression for Binary doc value fields
jpountz commented on a change in pull request #1234: Add compression for Binary doc value fields URL: https://github.com/apache/lucene-solr/pull/1234#discussion_r379473860 ## File path: lucene/core/src/java/org/apache/lucene/codecs/lucene80/Lucene80DocValuesProducer.java ## @@ -742,6 +757,125 @@ public BytesRef binaryValue() throws IOException { }; } } + } + + // Decompresses blocks of binary values to retrieve content + class BinaryDecoder { + +private final LongValues addresses; +private final IndexInput compressedData; +// Cache of last uncompressed block +private long lastBlockId = -1; +private final int []uncompressedDocStarts; +private int uncompressedBlockLength = 0; +private final byte[] uncompressedBlock; +private final BytesRef uncompressedBytesRef; +private final int docsPerChunk; + +public BinaryDecoder(LongValues addresses, IndexInput compressedData, int biggestUncompressedBlockSize, int docsPerChunk) { + super(); + this.addresses = addresses; + this.compressedData = compressedData; + // pre-allocate a byte array large enough for the biggest uncompressed block needed. + this.uncompressedBlock = new byte[biggestUncompressedBlockSize]; + uncompressedBytesRef = new BytesRef(uncompressedBlock); + this.docsPerChunk = docsPerChunk; + uncompressedDocStarts = new int[docsPerChunk + 1]; + +} + +BytesRef decode(int docNumber) throws IOException { + int blockId = docNumber >> Lucene80DocValuesFormat.BINARY_BLOCK_SHIFT; Review comment: I think so. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] markharwood commented on a change in pull request #1234: Add compression for Binary doc value fields
markharwood commented on a change in pull request #1234: Add compression for Binary doc value fields URL: https://github.com/apache/lucene-solr/pull/1234#discussion_r379463675 ## File path: lucene/core/src/java/org/apache/lucene/codecs/lucene80/Lucene80DocValuesProducer.java ## @@ -182,10 +183,21 @@ private BinaryEntry readBinary(ChecksumIndexInput meta) throws IOException { entry.numDocsWithField = meta.readInt(); entry.minLength = meta.readInt(); entry.maxLength = meta.readInt(); -if (entry.minLength < entry.maxLength) { +if ((version >= Lucene80DocValuesFormat.VERSION_BIN_COMPRESSED && entry.numDocsWithField >0)|| entry.minLength < entry.maxLength) { entry.addressesOffset = meta.readLong(); + + // Old count of uncompressed addresses + long numAddresses = entry.numDocsWithField + 1L; + // New count of compressed addresses - the number of compresseed blocks + if (version >= Lucene80DocValuesFormat.VERSION_BIN_COMPRESSED) { +entry.numCompressedChunks = meta.readVInt(); +entry.docsPerChunk = meta.readVInt(); Review comment: Ah - ignore my previous comment. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] markharwood commented on a change in pull request #1234: Add compression for Binary doc value fields
markharwood commented on a change in pull request #1234: Add compression for Binary doc value fields URL: https://github.com/apache/lucene-solr/pull/1234#discussion_r379463440 ## File path: lucene/core/src/java/org/apache/lucene/codecs/lucene80/Lucene80DocValuesProducer.java ## @@ -742,6 +757,125 @@ public BytesRef binaryValue() throws IOException { }; } } + } + + // Decompresses blocks of binary values to retrieve content + class BinaryDecoder { + +private final LongValues addresses; +private final IndexInput compressedData; +// Cache of last uncompressed block +private long lastBlockId = -1; +private final int []uncompressedDocStarts; +private int uncompressedBlockLength = 0; +private final byte[] uncompressedBlock; +private final BytesRef uncompressedBytesRef; +private final int docsPerChunk; + +public BinaryDecoder(LongValues addresses, IndexInput compressedData, int biggestUncompressedBlockSize, int docsPerChunk) { + super(); + this.addresses = addresses; + this.compressedData = compressedData; + // pre-allocate a byte array large enough for the biggest uncompressed block needed. + this.uncompressedBlock = new byte[biggestUncompressedBlockSize]; + uncompressedBytesRef = new BytesRef(uncompressedBlock); + this.docsPerChunk = docsPerChunk; + uncompressedDocStarts = new int[docsPerChunk + 1]; + +} + +BytesRef decode(int docNumber) throws IOException { + int blockId = docNumber >> Lucene80DocValuesFormat.BINARY_BLOCK_SHIFT; Review comment: I guess that means I should serialize the shift value rather the absolute number of docs per block? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (SOLR-10306) Document vm.swappiness and swapoff in RefGuide
[ https://issues.apache.org/jira/browse/SOLR-10306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jan Høydahl updated SOLR-10306: --- Summary: Document vm.swappiness and swapoff in RefGuide (was: Document vm.swappiness and mlockall in RefGuide) > Document vm.swappiness and swapoff in RefGuide > -- > > Key: SOLR-10306 > URL: https://issues.apache.org/jira/browse/SOLR-10306 > Project: Solr > Issue Type: Task > Components: documentation >Reporter: Jan Høydahl >Assignee: Jan Høydahl >Priority: Minor > Time Spent: 20m > Remaining Estimate: 0h > > I think we should document sane best practice OS level settings in the ref > guide, e.g. in > https://cwiki.apache.org/confluence/display/solr/Taking+Solr+to+Production > Such as lower system swappiness or ability to use mlockall (like this ES page > https://www.elastic.co/guide/en/elasticsearch/guide/current/heap-sizing.html) > I also found this github repo https://github.com/LucidWorks/mlockall-agent - > it is old, did anyone have good experience with the agent for locking Solr's > memory? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] janhoy commented on issue #1256: SOLR-10306: Document in Reference Guide how to disable or reduce swapping
janhoy commented on issue #1256: SOLR-10306: Document in Reference Guide how to disable or reduce swapping URL: https://github.com/apache/lucene-solr/pull/1256#issuecomment-586303086 Done some changes. There is probably some sub-par language as well, don't be afraid to shoot me down :) This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] alessandrobenedetti commented on issue #357: [SOLR-12238] Synonym Queries boost
alessandrobenedetti commented on issue #357: [SOLR-12238] Synonym Queries boost URL: https://github.com/apache/lucene-solr/pull/357#issuecomment-586284506 As I was writing the blog I thought that the Lucene changes would automatically bring the same feature to Elasticsearch, but then I kinda ended up in an another analysis common package in Elasticsearch that is a duplicated of the Lucene one (but still it imports the Lucene stuff). https://github.com/elastic/elasticsearch/blob/24e1858a70bd255ebc210415acaac1bfb40340d3/modules/analysis-common/src/main/java/org/elasticsearch/analysis/common/CommonAnalysisPlugin.java Am I correct to say that this feature won't appear automatically in Lucene unless we create a DelimitedBoostTokenFilterFactory there as well? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] iverase commented on a change in pull request #587: LUCENE-8707: Add LatLonShape and XYShape distance query
iverase commented on a change in pull request #587: LUCENE-8707: Add LatLonShape and XYShape distance query URL: https://github.com/apache/lucene-solr/pull/587#discussion_r379411676 ## File path: lucene/core/src/java/org/apache/lucene/geo/Circle2D.java ## @@ -0,0 +1,463 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.lucene.geo; + +import org.apache.lucene.index.PointValues.Relation; +import org.apache.lucene.util.SloppyMath; + +/** + * 2D circle implementation containing spatial logic. + */ +class Circle2D implements Component2D { + + private final DistanceCalculator calculator; + + private Circle2D(DistanceCalculator calculator) { +this.calculator = calculator; + } + + @Override + public double getMinX() { +return calculator.getMinX(); + } + + @Override + public double getMaxX() { +return calculator.getMaxX(); + } + + @Override + public double getMinY() { +return calculator.getMinY(); + } + + @Override + public double getMaxY() { +return calculator.getMaxY(); + } + + @Override + public boolean contains(double x, double y) { +return calculator.contains(x, y); + } + + @Override + public Relation relate(double minX, double maxX, double minY, double maxY) { +if (calculator.disjoint(minX, maxX, minY, maxY)) { + return Relation.CELL_OUTSIDE_QUERY; +} +if (calculator.within(minX, maxX, minY, maxY)) { + return Relation.CELL_CROSSES_QUERY; +} +return calculator.relate(minX, maxX, minY, maxY); + } + + @Override + public Relation relateTriangle(double minX, double maxX, double minY, double maxY, + double ax, double ay, double bx, double by, double cx, double cy) { +if (calculator.disjoint(minX, maxX, minY, maxY)) { + return Relation.CELL_OUTSIDE_QUERY; +} +if (ax == bx && bx == cx && ay == by && by == cy) { + // indexed "triangle" is a point: shortcut by checking contains + return contains(ax, ay) ? Relation.CELL_INSIDE_QUERY : Relation.CELL_OUTSIDE_QUERY; +} else if (ax == cx && ay == cy) { + // indexed "triangle" is a line segment: shortcut by calling appropriate method + return relateIndexedLineSegment(ax, ay, bx, by); +} else if (ax == bx && ay == by) { + // indexed "triangle" is a line segment: shortcut by calling appropriate method + return relateIndexedLineSegment(bx, by, cx, cy); +} else if (bx == cx && by == cy) { + // indexed "triangle" is a line segment: shortcut by calling appropriate method + return relateIndexedLineSegment(cx, cy, ax, ay); +} +// indexed "triangle" is a triangle: +return relateIndexedTriangle(minX, maxX, minY, maxY, ax, ay, bx, by, cx, cy); + } + + @Override + public WithinRelation withinTriangle(double minX, double maxX, double minY, double maxY, + double ax, double ay, boolean ab, double bx, double by, boolean bc, double cx, double cy, boolean ca) { +// short cut, lines and points cannot contain this type of shape +if ((ax == bx && ay == by) || (ax == cx && ay == cy) || (bx == cx && by == cy)) { + return WithinRelation.DISJOINT; Review comment: Strictly speaking yes but a point or a line would never return CANDIDATE for a circle. I guess I am using disjoint because is the cheaper answer (the shape is actually ignored) but I agree this is probably wrong. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9187) remove too-expensive assert from LZ4 HighCompressionHashTable
[ https://issues.apache.org/jira/browse/LUCENE-9187?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17036955#comment-17036955 ] ASF subversion and git services commented on LUCENE-9187: - Commit 210f2f83f7c637a8c0f96fe563ab8b49fa18dfd9 in lucene-solr's branch refs/heads/branch_8x from Adrien Grand [ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=210f2f8 ] Add back assertions removed by LUCENE-9187. (#1236) This time they would only apply to TestFastLZ4/TestHighLZ4 and avoid slowing down all tests. > remove too-expensive assert from LZ4 HighCompressionHashTable > - > > Key: LUCENE-9187 > URL: https://issues.apache.org/jira/browse/LUCENE-9187 > Project: Lucene - Core > Issue Type: Task >Reporter: Robert Muir >Priority: Major > Fix For: master (9.0) > > Attachments: LUCENE-9187.patch > > Time Spent: 40m > Remaining Estimate: 0h > > This is the slowest method in the lucene tests. See LUCENE-9185 for what I > mean. > If you look at it, its checking 64k values every time the assert is called. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9187) remove too-expensive assert from LZ4 HighCompressionHashTable
[ https://issues.apache.org/jira/browse/LUCENE-9187?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17036954#comment-17036954 ] ASF subversion and git services commented on LUCENE-9187: - Commit da33e4aa6f92ee622a9605636bc5fe1c105f1f94 in lucene-solr's branch refs/heads/branch_8x from Robert Muir [ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=da33e4a ] LUCENE-9187: remove too-expensive assert from LZ4 HighCompressionHashTable > remove too-expensive assert from LZ4 HighCompressionHashTable > - > > Key: LUCENE-9187 > URL: https://issues.apache.org/jira/browse/LUCENE-9187 > Project: Lucene - Core > Issue Type: Task >Reporter: Robert Muir >Priority: Major > Fix For: master (9.0) > > Attachments: LUCENE-9187.patch > > Time Spent: 40m > Remaining Estimate: 0h > > This is the slowest method in the lucene tests. See LUCENE-9185 for what I > mean. > If you look at it, its checking 64k values every time the assert is called. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] andywebb1975 commented on issue #1247: SOLR-14252 use double rather than Double to avoid NPE
andywebb1975 commented on issue #1247: SOLR-14252 use double rather than Double to avoid NPE URL: https://github.com/apache/lucene-solr/pull/1247#issuecomment-586263732 Note it's reporter config like the below that triggers the exceptions - the cache metrics include `LocalStatsCache` values: ``` 10 UPDATE\./update/.*requests QUERY\./select.*requests CACHE\.searcher.* ``` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] andywebb1975 commented on a change in pull request #1247: SOLR-14252 use double rather than Double to avoid NPE
andywebb1975 commented on a change in pull request #1247: SOLR-14252 use double rather than Double to avoid NPE URL: https://github.com/apache/lucene-solr/pull/1247#discussion_r379399436 ## File path: solr/core/src/java/org/apache/solr/metrics/AggregateMetric.java ## @@ -51,6 +61,18 @@ public String toString() { ", updateCount=" + updateCount + '}'; } + +public double toDouble() { Review comment: I don't think so, no - and I think some/all of the the other `public` things in these classes could be made less accessible too. Happy to take a look at this, but would like to make sure the metrics aggregation is working correctly first. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] andywebb1975 commented on a change in pull request #1247: SOLR-14252 use double rather than Double to avoid NPE
andywebb1975 commented on a change in pull request #1247: SOLR-14252 use double rather than Double to avoid NPE URL: https://github.com/apache/lucene-solr/pull/1247#discussion_r379398056 ## File path: solr/core/src/java/org/apache/solr/metrics/AggregateMetric.java ## @@ -51,6 +61,18 @@ public String toString() { ", updateCount=" + updateCount + '}'; } + +public double toDouble() { + if (value instanceof Boolean) { +return 0; + } + if (!(value instanceof Number)) { +log.debug("not a Number: " + value); Review comment: yes - if this logging is useful, I may just remove it instead. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] andywebb1975 commented on a change in pull request #1247: SOLR-14252 use double rather than Double to avoid NPE
andywebb1975 commented on a change in pull request #1247: SOLR-14252 use double rather than Double to avoid NPE URL: https://github.com/apache/lucene-solr/pull/1247#discussion_r379397661 ## File path: solr/core/src/java/org/apache/solr/metrics/AggregateMetric.java ## @@ -93,18 +115,11 @@ public double getMax() { if (values.isEmpty()) { return 0; } -Double res = null; +double res = 0; Review comment: That's a good point - I don't know. (Who might know?) It may be better to switch to creating a list of numerical values then finding its min/max/mean/etc (ideally using standard library functions) - but this may be overkill. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] iverase opened a new pull request #1258: LUCENE-9225: Rectangle should extend LatLonGeometry
iverase opened a new pull request #1258: LUCENE-9225: Rectangle should extend LatLonGeometry URL: https://github.com/apache/lucene-solr/pull/1258 Rectangle now extends LatLonGeometry so it can be used as part of a geometry collection. We need to be careful for Contains and we need to split the rectangle in two if it crossest the dateline. Test is added to check we get the same results from tLatLotBoundingBoxQuery and the corresponding geometry query. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Created] (LUCENE-9225) Rectangle should extend LatLonGeometry
Ignacio Vera created LUCENE-9225: Summary: Rectangle should extend LatLonGeometry Key: LUCENE-9225 URL: https://issues.apache.org/jira/browse/LUCENE-9225 Project: Lucene - Core Issue Type: Improvement Reporter: Ignacio Vera Rectangle class is the only geometry class that do not extend LatLonGeometry. This is because we have an specialise query for rectangles that works on the encoded space (very similar to what LatLonPoint is doing). It would be nice if Rectangle could implement LatLonGeometry, so in cases where a bounding box is part of a complex geometry, it can fall back to Component2D objects. The idea is to move the specialise logic in Rectangle2D inside the specialised LatLonBoundingBoxQuery and rename the current XYRectangle2D to Rectangle2D. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Comment Edited] (LUCENE-9217) Add validation for XYGeometries
[ https://issues.apache.org/jira/browse/LUCENE-9217?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17036898#comment-17036898 ] Ignacio Vera edited comment on LUCENE-9217 at 2/14/20 10:44 AM: fixed in LUCENE-9218 was (Author: ivera): resolve in LUCENE-9218 > Add validation for XYGeometries > --- > > Key: LUCENE-9217 > URL: https://issues.apache.org/jira/browse/LUCENE-9217 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Ignacio Vera >Assignee: Ignacio Vera >Priority: Minor > Time Spent: 50m > Remaining Estimate: 0h > > Currently when creating XY geometries, there is no proper validation, in > particular checks for NaN, INF and -INF value which should not be allowed. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Resolved] (LUCENE-9217) Add validation for XYGeometries
[ https://issues.apache.org/jira/browse/LUCENE-9217?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ignacio Vera resolved LUCENE-9217. -- Assignee: Ignacio Vera Resolution: Won't Fix resolve in LUCENE-9218 > Add validation for XYGeometries > --- > > Key: LUCENE-9217 > URL: https://issues.apache.org/jira/browse/LUCENE-9217 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Ignacio Vera >Assignee: Ignacio Vera >Priority: Minor > Time Spent: 50m > Remaining Estimate: 0h > > Currently when creating XY geometries, there is no proper validation, in > particular checks for NaN, INF and -INF value which should not be allowed. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Resolved] (LUCENE-9218) XYGeometries should use floats instead of doubles
[ https://issues.apache.org/jira/browse/LUCENE-9218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ignacio Vera resolved LUCENE-9218. -- Fix Version/s: 8.5 Assignee: Ignacio Vera Resolution: Fixed > XYGeometries should use floats instead of doubles > - > > Key: LUCENE-9218 > URL: https://issues.apache.org/jira/browse/LUCENE-9218 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Ignacio Vera >Assignee: Ignacio Vera >Priority: Minor > Fix For: 8.5 > > Time Spent: 0.5h > Remaining Estimate: 0h > > XYGeometries (XYPolygon, XYLine, XYRectangle & XYPoint) are a bit > counter-intuitive. Where most of them are initialised using floats, when > returning those values, they are returned as doubles. In addition XYRectangle > seems to work on doubles. > In this issue it is proposed to harmonise those classes to only work on > floats. As these classes were just move to core and they have not been > released, it should be ok to change its interfaces. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] iverase closed pull request #1249: LUCENE-9217: Add validation to XYGeometries
iverase closed pull request #1249: LUCENE-9217: Add validation to XYGeometries URL: https://github.com/apache/lucene-solr/pull/1249 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] iverase commented on issue #1249: LUCENE-9217: Add validation to XYGeometries
iverase commented on issue #1249: LUCENE-9217: Add validation to XYGeometries URL: https://github.com/apache/lucene-solr/pull/1249#issuecomment-586204933 fixed in #1252 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9218) XYGeometries should use floats instead of doubles
[ https://issues.apache.org/jira/browse/LUCENE-9218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17036895#comment-17036895 ] ASF subversion and git services commented on LUCENE-9218: - Commit ca3319cdbce14120291b1dfcda94d25abd677276 in lucene-solr's branch refs/heads/branch_8x from Ignacio Vera [ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=ca3319c ] LUCENE-9218: XYGeometries should expose values as floats (#1252) > XYGeometries should use floats instead of doubles > - > > Key: LUCENE-9218 > URL: https://issues.apache.org/jira/browse/LUCENE-9218 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Ignacio Vera >Priority: Minor > Time Spent: 0.5h > Remaining Estimate: 0h > > XYGeometries (XYPolygon, XYLine, XYRectangle & XYPoint) are a bit > counter-intuitive. Where most of them are initialised using floats, when > returning those values, they are returned as doubles. In addition XYRectangle > seems to work on doubles. > In this issue it is proposed to harmonise those classes to only work on > floats. As these classes were just move to core and they have not been > released, it should be ok to change its interfaces. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] iverase merged pull request #1252: LUCENE-9218: XYGeometries should expose values as floats
iverase merged pull request #1252: LUCENE-9218: XYGeometries should expose values as floats URL: https://github.com/apache/lucene-solr/pull/1252 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9218) XYGeometries should use floats instead of doubles
[ https://issues.apache.org/jira/browse/LUCENE-9218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17036893#comment-17036893 ] ASF subversion and git services commented on LUCENE-9218: - Commit 4a54ffb553ad1e45da147dd93f63a20bb4564c91 in lucene-solr's branch refs/heads/master from Ignacio Vera [ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=4a54ffb ] LUCENE-9218: XYGeometries should expose values as floats (#1252) > XYGeometries should use floats instead of doubles > - > > Key: LUCENE-9218 > URL: https://issues.apache.org/jira/browse/LUCENE-9218 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Ignacio Vera >Priority: Minor > Time Spent: 0.5h > Remaining Estimate: 0h > > XYGeometries (XYPolygon, XYLine, XYRectangle & XYPoint) are a bit > counter-intuitive. Where most of them are initialised using floats, when > returning those values, they are returned as doubles. In addition XYRectangle > seems to work on doubles. > In this issue it is proposed to harmonise those classes to only work on > floats. As these classes were just move to core and they have not been > released, it should be ok to change its interfaces. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9203) Make DocValuesIterator public
[ https://issues.apache.org/jira/browse/LUCENE-9203?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17036878#comment-17036878 ] Adrien Grand commented on LUCENE-9203: -- Yes indeed, I remember another instance of it which I can't find again that only needed this class because it was trying to stuff everything into a base class when composition instead of inheritance would have made things even simpler. I'm not completely opposed to making it public but I'd like to see a compelling use-case for it. You mentioned consuming all doc values in a generic manner but the couple cases I've seen were better served by using nextDoc/advance than advanceExact? We don't give ways to consume values in a generic manner either but I've not seen many asks for it? > Make DocValuesIterator public > - > > Key: LUCENE-9203 > URL: https://issues.apache.org/jira/browse/LUCENE-9203 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index >Affects Versions: 8.4 >Reporter: juan camilo rodriguez duran >Priority: Trivial > Labels: docValues > > By doing this, we improve extensibility for new formats. Additionally this > will improve coherence with the public method already existent in the class. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9187) remove too-expensive assert from LZ4 HighCompressionHashTable
[ https://issues.apache.org/jira/browse/LUCENE-9187?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17036845#comment-17036845 ] ASF subversion and git services commented on LUCENE-9187: - Commit 5cbe58f22c71cfe3f6d21b4661914c255ac80e3d in lucene-solr's branch refs/heads/master from Adrien Grand [ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=5cbe58f ] Add back assertions removed by LUCENE-9187. (#1236) This time they would only apply to TestFastLZ4/TestHighLZ4 and avoid slowing down all tests. > remove too-expensive assert from LZ4 HighCompressionHashTable > - > > Key: LUCENE-9187 > URL: https://issues.apache.org/jira/browse/LUCENE-9187 > Project: Lucene - Core > Issue Type: Task >Reporter: Robert Muir >Priority: Major > Fix For: master (9.0) > > Attachments: LUCENE-9187.patch > > Time Spent: 40m > Remaining Estimate: 0h > > This is the slowest method in the lucene tests. See LUCENE-9185 for what I > mean. > If you look at it, its checking 64k values every time the assert is called. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] jpountz merged pull request #1236: Add back assertions removed by LUCENE-9187.
jpountz merged pull request #1236: Add back assertions removed by LUCENE-9187. URL: https://github.com/apache/lucene-solr/pull/1236 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Resolved] (LUCENE-8762) Lucene50PostingsReader should specialize reading docs+freqs with impacts
[ https://issues.apache.org/jira/browse/LUCENE-8762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adrien Grand resolved LUCENE-8762. -- Resolution: Won't Fix This has been implemented in the new Lucene84PostingsFormat. > Lucene50PostingsReader should specialize reading docs+freqs with impacts > > > Key: LUCENE-8762 > URL: https://issues.apache.org/jira/browse/LUCENE-8762 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Adrien Grand >Priority: Minor > Time Spent: 2.5h > Remaining Estimate: 0h > > Currently if you ask for impacts, we only have one implementation that is > able to expose everything: docs, freqs, positions and offsets. In contrast, > if you don't need impacts, we have specialization for docs+freqs, > docs+freqs+positions and docs+freqs+positions+offsets. > Maybe we should add specialization for the docs+freqs case with impacts, > which should be the most common case, and remove specialization for > docs+freqs+positions when impacts are not requested? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] jpountz commented on a change in pull request #1214: LUCENE-9074: Slice Allocation Control Plane For Concurrent Searches
jpountz commented on a change in pull request #1214: LUCENE-9074: Slice Allocation Control Plane For Concurrent Searches URL: https://github.com/apache/lucene-solr/pull/1214#discussion_r379320715 ## File path: lucene/core/src/java/org/apache/lucene/search/IndexSearcher.java ## @@ -210,12 +216,27 @@ public IndexSearcher(IndexReader r, Executor executor) { public IndexSearcher(IndexReaderContext context, Executor executor) { assert context.isTopLevel: "IndexSearcher's ReaderContext must be topLevel for reader" + context.reader(); reader = context.reader(); -this.executor = executor; +this.sliceExecutionControlPlane = getSliceExecutionControlPlane(executor); this.readerContext = context; leafContexts = context.leaves(); this.leafSlices = executor == null ? null : slices(leafContexts); } + /** + * We do this elaborate dance as to have only one constructor with a nullable second parameter + * See the next constructor for more clarification + * Only for testing + */ + IndexSearcher(IndexReaderContext context, Executor executor, Review comment: this doesn't need an executor, does it? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] jpountz commented on a change in pull request #1214: LUCENE-9074: Slice Allocation Control Plane For Concurrent Searches
jpountz commented on a change in pull request #1214: LUCENE-9074: Slice Allocation Control Plane For Concurrent Searches URL: https://github.com/apache/lucene-solr/pull/1214#discussion_r379320286 ## File path: lucene/core/src/java/org/apache/lucene/search/DefaultExecutionControlPlane.java ## @@ -0,0 +1,96 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.lucene.search; + +import java.util.ArrayList; +import java.util.Collection; +import java.util.List; +import java.util.concurrent.CompletableFuture; +import java.util.concurrent.Executor; +import java.util.concurrent.Future; +import java.util.concurrent.FutureTask; +import java.util.concurrent.RejectedExecutionException; + +/** + * Default implementation of SliceExecutionControlPlane which executes FutureTask instances. This is used + * by IndexSearcher as default unless overridden by a custom implementation. + */ +public class DefaultExecutionControlPlane implements SliceExecutionControlPlane, FutureTask> { + private final Executor executor; + + public DefaultExecutionControlPlane(Executor executor) { +assert executor != null; +this.executor = executor; + } + + @Override + public List invokeAll(Collection tasks) { +List futures = new ArrayList(); +int i = 0; + +for (FutureTask task : tasks) { + boolean shouldExecuteOnCallerThread = false; + + // Execute last task on caller thread + if (i == tasks.size() - 1) { +shouldExecuteOnCallerThread = true; + } + + processTask(task, futures, shouldExecuteOnCallerThread); + + i++; +} + +return futures; + } + + // Helper method to execute a single task + protected void processTask(FutureTask task, List futures, + boolean shouldExecuteOnCallerThread) { +if (task == null) { + throw new IllegalArgumentException("Input is null"); +} + +if (!shouldExecuteOnCallerThread) { + try { +executor.execute(task); + } catch (RejectedExecutionException e) { +// Execute on caller thread +shouldExecuteOnCallerThread = true; + } +} + +if (shouldExecuteOnCallerThread) { + try { +task.run(); + } catch (Exception e) { +throw new RuntimeException(e.getMessage()); + } +} + +if (!shouldExecuteOnCallerThread) { + futures.add(task); +} else { + try { +futures.add(CompletableFuture.completedFuture(task.get())); + } catch (Exception e) { +throw new RuntimeException(e.getMessage()); Review comment: please preserve the stack trace of the cause This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] jpountz commented on a change in pull request #1214: LUCENE-9074: Slice Allocation Control Plane For Concurrent Searches
jpountz commented on a change in pull request #1214: LUCENE-9074: Slice Allocation Control Plane For Concurrent Searches URL: https://github.com/apache/lucene-solr/pull/1214#discussion_r379322585 ## File path: lucene/core/src/java/org/apache/lucene/search/SliceExecutionControlPlane.java ## @@ -0,0 +1,35 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.lucene.search; + +import java.util.Collection; + + +/** + * Execution control plane which is responsible + * for execution of slices based on the current status + * of the system and current system load + */ +public interface SliceExecutionControlPlane { + /** + * Invoke all slices that are allocated for the query + */ + C invokeAll(Collection tasks); +} Review comment: the generics on this interface look over-engineered? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] jpountz commented on a change in pull request #1214: LUCENE-9074: Slice Allocation Control Plane For Concurrent Searches
jpountz commented on a change in pull request #1214: LUCENE-9074: Slice Allocation Control Plane For Concurrent Searches URL: https://github.com/apache/lucene-solr/pull/1214#discussion_r379318977 ## File path: lucene/test-framework/src/java/org/apache/lucene/util/LuceneTestCase.java ## @@ -113,38 +139,10 @@ import org.junit.ClassRule; import org.junit.Rule; import org.junit.Test; +import org.junit.internal.AssumptionViolatedException; import org.junit.rules.RuleChain; import org.junit.rules.TestRule; import org.junit.runner.RunWith; -import org.junit.internal.AssumptionViolatedException; - -import com.carrotsearch.randomizedtesting.JUnit4MethodProvider; -import com.carrotsearch.randomizedtesting.LifecycleScope; -import com.carrotsearch.randomizedtesting.MixWithSuiteName; -import com.carrotsearch.randomizedtesting.RandomizedContext; -import com.carrotsearch.randomizedtesting.RandomizedRunner; -import com.carrotsearch.randomizedtesting.RandomizedTest; -import com.carrotsearch.randomizedtesting.annotations.Listeners; -import com.carrotsearch.randomizedtesting.annotations.SeedDecorators; -import com.carrotsearch.randomizedtesting.annotations.TestGroup; -import com.carrotsearch.randomizedtesting.annotations.TestMethodProviders; -import com.carrotsearch.randomizedtesting.annotations.ThreadLeakAction.Action; -import com.carrotsearch.randomizedtesting.annotations.ThreadLeakAction; -import com.carrotsearch.randomizedtesting.annotations.ThreadLeakFilters; -import com.carrotsearch.randomizedtesting.annotations.ThreadLeakGroup.Group; -import com.carrotsearch.randomizedtesting.annotations.ThreadLeakGroup; -import com.carrotsearch.randomizedtesting.annotations.ThreadLeakLingering; -import com.carrotsearch.randomizedtesting.annotations.ThreadLeakScope.Scope; -import com.carrotsearch.randomizedtesting.annotations.ThreadLeakScope; -import com.carrotsearch.randomizedtesting.annotations.ThreadLeakZombies.Consequence; -import com.carrotsearch.randomizedtesting.annotations.ThreadLeakZombies; -import com.carrotsearch.randomizedtesting.annotations.TimeoutSuite; -import com.carrotsearch.randomizedtesting.generators.RandomPicks; -import com.carrotsearch.randomizedtesting.rules.NoClassHooksShadowingRule; -import com.carrotsearch.randomizedtesting.rules.NoInstanceHooksOverridesRule; -import com.carrotsearch.randomizedtesting.rules.StaticFieldsInvariantRule; - -import junit.framework.AssertionFailedError; Review comment: let's undo changes to this class? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] jpountz commented on a change in pull request #1214: LUCENE-9074: Slice Allocation Control Plane For Concurrent Searches
jpountz commented on a change in pull request #1214: LUCENE-9074: Slice Allocation Control Plane For Concurrent Searches URL: https://github.com/apache/lucene-solr/pull/1214#discussion_r379320355 ## File path: lucene/core/src/java/org/apache/lucene/search/DefaultExecutionControlPlane.java ## @@ -0,0 +1,96 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.lucene.search; + +import java.util.ArrayList; +import java.util.Collection; +import java.util.List; +import java.util.concurrent.CompletableFuture; +import java.util.concurrent.Executor; +import java.util.concurrent.Future; +import java.util.concurrent.FutureTask; +import java.util.concurrent.RejectedExecutionException; + +/** + * Default implementation of SliceExecutionControlPlane which executes FutureTask instances. This is used + * by IndexSearcher as default unless overridden by a custom implementation. + */ +public class DefaultExecutionControlPlane implements SliceExecutionControlPlane, FutureTask> { + private final Executor executor; + + public DefaultExecutionControlPlane(Executor executor) { +assert executor != null; +this.executor = executor; + } + + @Override + public List invokeAll(Collection tasks) { +List futures = new ArrayList(); +int i = 0; + +for (FutureTask task : tasks) { + boolean shouldExecuteOnCallerThread = false; + + // Execute last task on caller thread + if (i == tasks.size() - 1) { +shouldExecuteOnCallerThread = true; + } + + processTask(task, futures, shouldExecuteOnCallerThread); + + i++; +} + +return futures; + } + + // Helper method to execute a single task + protected void processTask(FutureTask task, List futures, + boolean shouldExecuteOnCallerThread) { +if (task == null) { + throw new IllegalArgumentException("Input is null"); +} + +if (!shouldExecuteOnCallerThread) { + try { +executor.execute(task); + } catch (RejectedExecutionException e) { +// Execute on caller thread +shouldExecuteOnCallerThread = true; + } +} + +if (shouldExecuteOnCallerThread) { + try { +task.run(); + } catch (Exception e) { +throw new RuntimeException(e.getMessage()); Review comment: please preserve the stack trace of the cause This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] jpountz commented on a change in pull request #587: LUCENE-8707: Add LatLonShape and XYShape distance query
jpountz commented on a change in pull request #587: LUCENE-8707: Add LatLonShape and XYShape distance query URL: https://github.com/apache/lucene-solr/pull/587#discussion_r379317415 ## File path: lucene/core/src/java/org/apache/lucene/geo/Circle2D.java ## @@ -0,0 +1,463 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.lucene.geo; + +import org.apache.lucene.index.PointValues.Relation; +import org.apache.lucene.util.SloppyMath; + +/** + * 2D circle implementation containing spatial logic. + */ +class Circle2D implements Component2D { + + private final DistanceCalculator calculator; + + private Circle2D(DistanceCalculator calculator) { +this.calculator = calculator; + } + + @Override + public double getMinX() { +return calculator.getMinX(); + } + + @Override + public double getMaxX() { +return calculator.getMaxX(); + } + + @Override + public double getMinY() { +return calculator.getMinY(); + } + + @Override + public double getMaxY() { +return calculator.getMaxY(); + } + + @Override + public boolean contains(double x, double y) { +return calculator.contains(x, y); + } + + @Override + public Relation relate(double minX, double maxX, double minY, double maxY) { +if (calculator.disjoint(minX, maxX, minY, maxY)) { + return Relation.CELL_OUTSIDE_QUERY; +} +if (calculator.within(minX, maxX, minY, maxY)) { + return Relation.CELL_CROSSES_QUERY; +} +return calculator.relate(minX, maxX, minY, maxY); + } + + @Override + public Relation relateTriangle(double minX, double maxX, double minY, double maxY, + double ax, double ay, double bx, double by, double cx, double cy) { +if (calculator.disjoint(minX, maxX, minY, maxY)) { + return Relation.CELL_OUTSIDE_QUERY; +} +if (ax == bx && bx == cx && ay == by && by == cy) { + // indexed "triangle" is a point: shortcut by checking contains + return contains(ax, ay) ? Relation.CELL_INSIDE_QUERY : Relation.CELL_OUTSIDE_QUERY; +} else if (ax == cx && ay == cy) { + // indexed "triangle" is a line segment: shortcut by calling appropriate method + return relateIndexedLineSegment(ax, ay, bx, by); +} else if (ax == bx && ay == by) { + // indexed "triangle" is a line segment: shortcut by calling appropriate method + return relateIndexedLineSegment(bx, by, cx, cy); +} else if (bx == cx && by == cy) { + // indexed "triangle" is a line segment: shortcut by calling appropriate method + return relateIndexedLineSegment(cx, cy, ax, ay); +} +// indexed "triangle" is a triangle: +return relateIndexedTriangle(minX, maxX, minY, maxY, ax, ay, bx, by, cx, cy); + } + + @Override + public WithinRelation withinTriangle(double minX, double maxX, double minY, double maxY, + double ax, double ay, boolean ab, double bx, double by, boolean bc, double cx, double cy, boolean ca) { +// short cut, lines and points cannot contain this type of shape +if ((ax == bx && ay == by) || (ax == cx && ay == cy) || (bx == cx && by == cy)) { + return WithinRelation.DISJOINT; Review comment: couldn't it be NOTWITHIN in some cases? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] jpountz commented on a change in pull request #1252: LUCENE-9218: XYGeometries should expose values as floats
jpountz commented on a change in pull request #1252: LUCENE-9218: XYGeometries should expose values as floats URL: https://github.com/apache/lucene-solr/pull/1252#discussion_r379314539 ## File path: lucene/core/src/java/org/apache/lucene/geo/XYEncodingUtils.java ## @@ -68,7 +68,15 @@ public static double decode(int encoded) { * @param offset offset into {@code src} to decode from. * @return decoded value. */ - public static double decode(byte[] src, int offset) { + public static float decode(byte[] src, int offset) { return decode(NumericUtils.sortableBytesToInt(src, offset)); } + + static double[] floatToDouble(float[] f) { Review comment: nit: maybe add `Array` to the method name This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] jpountz commented on a change in pull request #1252: LUCENE-9218: XYGeometries should expose values as floats
jpountz commented on a change in pull request #1252: LUCENE-9218: XYGeometries should expose values as floats URL: https://github.com/apache/lucene-solr/pull/1252#discussion_r379314365 ## File path: lucene/core/src/java/org/apache/lucene/geo/XYEncodingUtils.java ## @@ -33,11 +33,12 @@ private XYEncodingUtils() { } - /** validates value is within +/-{@link Float#MAX_VALUE} coordinate bounds */ - public static void checkVal(double x) { -if (Double.isNaN(x) || x < MIN_VAL_INCL || x > MAX_VAL_INCL) { + /** validates value is a number and finite */ + static float checkVal(float x) { +if (Float.isNaN(x) || Float.isInfinite(x)) { Review comment: This could be a single test. ```suggestion if (Float.isFinite(x) == false) { ``` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9211) Adding compression to BinaryDocValues storage
[ https://issues.apache.org/jira/browse/LUCENE-9211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17036806#comment-17036806 ] Adrien Grand commented on LUCENE-9211: -- I had a quick look at Juan's commit, there are things I like and things I have questions about. Since this PR is ready, or almost ready, I'd suggest merging this one first. [~juan.duran] I saw that your commit tried to modify the current Lucene80DocValuesFormat. I'm a bit nervous about it because it makes it hard to spot any potential subtle difference in the on-disk format that would cause bugs, so I'd suggest creating a new Lucene85DocValuesFormat instead, even if it has the same ideas or even same on-disk format as the current Lucene80DocValuesFormat? > Adding compression to BinaryDocValues storage > - > > Key: LUCENE-9211 > URL: https://issues.apache.org/jira/browse/LUCENE-9211 > Project: Lucene - Core > Issue Type: Improvement > Components: core/codecs >Reporter: Mark Harwood >Assignee: Mark Harwood >Priority: Minor > Labels: pull-request-available > > While SortedSetDocValues can be used today to store identical values in a > compact form this is not effective for data with many unique values. > The proposal is that BinaryDocValues should be stored in LZ4 compressed > blocks which can dramatically reduce disk storage costs in many cases. The > proposal is blocks of a number of documents are stored as a single compressed > blob along with metadata that records offsets where the original document > values can be found in the uncompressed content. > There's a trade-off here between efficient compression (more docs-per-block = > better compression) and fast retrieval times (fewer docs-per-block = faster > read access for single values). A fixed block size of 32 docs seems like it > would be a reasonable compromise for most scenarios. > A PR is up for review here [https://github.com/apache/lucene-solr/pull/1234] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] jpountz commented on a change in pull request #1234: Add compression for Binary doc value fields
jpountz commented on a change in pull request #1234: Add compression for Binary doc value fields URL: https://github.com/apache/lucene-solr/pull/1234#discussion_r379305128 ## File path: lucene/core/src/java/org/apache/lucene/codecs/lucene80/Lucene80DocValuesProducer.java ## @@ -742,6 +757,125 @@ public BytesRef binaryValue() throws IOException { }; } } + } + + // Decompresses blocks of binary values to retrieve content + class BinaryDecoder { + +private final LongValues addresses; +private final IndexInput compressedData; +// Cache of last uncompressed block +private long lastBlockId = -1; +private final int []uncompressedDocStarts; +private int uncompressedBlockLength = 0; +private final byte[] uncompressedBlock; +private final BytesRef uncompressedBytesRef; +private final int docsPerChunk; + +public BinaryDecoder(LongValues addresses, IndexInput compressedData, int biggestUncompressedBlockSize, int docsPerChunk) { + super(); + this.addresses = addresses; + this.compressedData = compressedData; + // pre-allocate a byte array large enough for the biggest uncompressed block needed. + this.uncompressedBlock = new byte[biggestUncompressedBlockSize]; + uncompressedBytesRef = new BytesRef(uncompressedBlock); + this.docsPerChunk = docsPerChunk; + uncompressedDocStarts = new int[docsPerChunk + 1]; + +} + +BytesRef decode(int docNumber) throws IOException { + int blockId = docNumber >> Lucene80DocValuesFormat.BINARY_BLOCK_SHIFT; Review comment: let's use the shift from the BinaryEntry instead of the constant? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] jpountz commented on a change in pull request #1234: Add compression for Binary doc value fields
jpountz commented on a change in pull request #1234: Add compression for Binary doc value fields URL: https://github.com/apache/lucene-solr/pull/1234#discussion_r379299380 ## File path: lucene/core/src/java/org/apache/lucene/codecs/lucene80/Lucene80DocValuesConsumer.java ## @@ -353,67 +360,168 @@ private void writeBlock(long[] values, int length, long gcd, ByteBuffersDataOutp } } - @Override - public void addBinaryField(FieldInfo field, DocValuesProducer valuesProducer) throws IOException { -meta.writeInt(field.number); -meta.writeByte(Lucene80DocValuesFormat.BINARY); - -BinaryDocValues values = valuesProducer.getBinary(field); -long start = data.getFilePointer(); -meta.writeLong(start); // dataOffset -int numDocsWithField = 0; -int minLength = Integer.MAX_VALUE; -int maxLength = 0; -for (int doc = values.nextDoc(); doc != DocIdSetIterator.NO_MORE_DOCS; doc = values.nextDoc()) { - numDocsWithField++; - BytesRef v = values.binaryValue(); - int length = v.length; - data.writeBytes(v.bytes, v.offset, v.length); - minLength = Math.min(length, minLength); - maxLength = Math.max(length, maxLength); + class CompressedBinaryBlockWriter implements Closeable { +FastCompressionHashTable ht = new LZ4.FastCompressionHashTable(); +int uncompressedBlockLength = 0; +int maxUncompressedBlockLength = 0; +int numDocsInCurrentBlock = 0; +int [] docLengths = new int[Lucene80DocValuesFormat.BINARY_DOCS_PER_COMPRESSED_BLOCK]; +byte [] block = new byte [1024 * 16]; +int totalChunks = 0; +long maxPointer = 0; +long blockAddressesStart = -1; + +private IndexOutput tempBinaryOffsets; + + +public CompressedBinaryBlockWriter() throws IOException { + tempBinaryOffsets = state.directory.createTempOutput(state.segmentInfo.name, "binary_pointers", state.context); + try { +CodecUtil.writeHeader(tempBinaryOffsets, Lucene80DocValuesFormat.META_CODEC + "FilePointers", Lucene80DocValuesFormat.VERSION_CURRENT); + } catch (Throwable exception) { +IOUtils.closeWhileHandlingException(this); //self-close because constructor caller can't +throw exception; + } } -assert numDocsWithField <= maxDoc; -meta.writeLong(data.getFilePointer() - start); // dataLength -if (numDocsWithField == 0) { - meta.writeLong(-2); // docsWithFieldOffset - meta.writeLong(0L); // docsWithFieldLength - meta.writeShort((short) -1); // jumpTableEntryCount - meta.writeByte((byte) -1); // denseRankPower -} else if (numDocsWithField == maxDoc) { - meta.writeLong(-1); // docsWithFieldOffset - meta.writeLong(0L); // docsWithFieldLength - meta.writeShort((short) -1); // jumpTableEntryCount - meta.writeByte((byte) -1); // denseRankPower -} else { - long offset = data.getFilePointer(); - meta.writeLong(offset); // docsWithFieldOffset - values = valuesProducer.getBinary(field); - final short jumpTableEntryCount = IndexedDISI.writeBitSet(values, data, IndexedDISI.DEFAULT_DENSE_RANK_POWER); - meta.writeLong(data.getFilePointer() - offset); // docsWithFieldLength - meta.writeShort(jumpTableEntryCount); - meta.writeByte(IndexedDISI.DEFAULT_DENSE_RANK_POWER); +void addDoc(int doc, BytesRef v) throws IOException { + if (blockAddressesStart < 0) { +blockAddressesStart = data.getFilePointer(); + } + docLengths[numDocsInCurrentBlock] = v.length; + block = ArrayUtil.grow(block, uncompressedBlockLength + v.length); + System.arraycopy(v.bytes, v.offset, block, uncompressedBlockLength, v.length); + uncompressedBlockLength += v.length; + numDocsInCurrentBlock++; + if (numDocsInCurrentBlock == Lucene80DocValuesFormat.BINARY_DOCS_PER_COMPRESSED_BLOCK) { +flushData(); + } } -meta.writeInt(numDocsWithField); -meta.writeInt(minLength); -meta.writeInt(maxLength); -if (maxLength > minLength) { - start = data.getFilePointer(); - meta.writeLong(start); +private void flushData() throws IOException { + if (numDocsInCurrentBlock > 0) { +// Write offset to this block to temporary offsets file +totalChunks++; +long thisBlockStartPointer = data.getFilePointer(); +data.writeVInt(numDocsInCurrentBlock); +for (int i = 0; i < numDocsInCurrentBlock; i++) { + data.writeVInt(docLengths[i]); +} +maxUncompressedBlockLength = Math.max(maxUncompressedBlockLength, uncompressedBlockLength); +LZ4.compress(block, 0, uncompressedBlockLength, data, ht); +numDocsInCurrentBlock = 0; +uncompressedBlockLength = 0; +maxPointer = data.getFilePointer(); +tempBinaryOffsets.writeVLong(maxPointer - thisBlockStartPointer); + } +} + +void writeMetaData() throws IOException { + if
[GitHub] [lucene-solr] jpountz commented on a change in pull request #1234: Add compression for Binary doc value fields
jpountz commented on a change in pull request #1234: Add compression for Binary doc value fields URL: https://github.com/apache/lucene-solr/pull/1234#discussion_r379306909 ## File path: lucene/core/src/java/org/apache/lucene/codecs/lucene80/Lucene80DocValuesConsumer.java ## @@ -353,67 +360,193 @@ private void writeBlock(long[] values, int length, long gcd, ByteBuffersDataOutp } } - @Override - public void addBinaryField(FieldInfo field, DocValuesProducer valuesProducer) throws IOException { -meta.writeInt(field.number); -meta.writeByte(Lucene80DocValuesFormat.BINARY); - -BinaryDocValues values = valuesProducer.getBinary(field); -long start = data.getFilePointer(); -meta.writeLong(start); // dataOffset -int numDocsWithField = 0; -int minLength = Integer.MAX_VALUE; -int maxLength = 0; -for (int doc = values.nextDoc(); doc != DocIdSetIterator.NO_MORE_DOCS; doc = values.nextDoc()) { - numDocsWithField++; - BytesRef v = values.binaryValue(); - int length = v.length; - data.writeBytes(v.bytes, v.offset, v.length); - minLength = Math.min(length, minLength); - maxLength = Math.max(length, maxLength); + class CompressedBinaryBlockWriter implements Closeable { +FastCompressionHashTable ht = new LZ4.FastCompressionHashTable(); +int uncompressedBlockLength = 0; +int maxUncompressedBlockLength = 0; +int numDocsInCurrentBlock = 0; +int[] docLengths = new int[Lucene80DocValuesFormat.BINARY_DOCS_PER_COMPRESSED_BLOCK]; +byte[] block = new byte [1024 * 16]; +int totalChunks = 0; +long maxPointer = 0; +long blockAddressesStart = -1; + +private IndexOutput tempBinaryOffsets; + + +public CompressedBinaryBlockWriter() throws IOException { + tempBinaryOffsets = state.directory.createTempOutput(state.segmentInfo.name, "binary_pointers", state.context); + boolean success = false; + try { +CodecUtil.writeHeader(tempBinaryOffsets, Lucene80DocValuesFormat.META_CODEC + "FilePointers", Lucene80DocValuesFormat.VERSION_CURRENT); +success = true; + } finally { +if (success == false) { + IOUtils.closeWhileHandlingException(this); //self-close because constructor caller can't +} + } } -assert numDocsWithField <= maxDoc; -meta.writeLong(data.getFilePointer() - start); // dataLength -if (numDocsWithField == 0) { - meta.writeLong(-2); // docsWithFieldOffset - meta.writeLong(0L); // docsWithFieldLength - meta.writeShort((short) -1); // jumpTableEntryCount - meta.writeByte((byte) -1); // denseRankPower -} else if (numDocsWithField == maxDoc) { - meta.writeLong(-1); // docsWithFieldOffset - meta.writeLong(0L); // docsWithFieldLength - meta.writeShort((short) -1); // jumpTableEntryCount - meta.writeByte((byte) -1); // denseRankPower -} else { - long offset = data.getFilePointer(); - meta.writeLong(offset); // docsWithFieldOffset - values = valuesProducer.getBinary(field); - final short jumpTableEntryCount = IndexedDISI.writeBitSet(values, data, IndexedDISI.DEFAULT_DENSE_RANK_POWER); - meta.writeLong(data.getFilePointer() - offset); // docsWithFieldLength - meta.writeShort(jumpTableEntryCount); - meta.writeByte(IndexedDISI.DEFAULT_DENSE_RANK_POWER); +void addDoc(int doc, BytesRef v) throws IOException { + if (blockAddressesStart < 0) { +blockAddressesStart = data.getFilePointer(); + } + docLengths[numDocsInCurrentBlock] = v.length; + block = ArrayUtil.grow(block, uncompressedBlockLength + v.length); + System.arraycopy(v.bytes, v.offset, block, uncompressedBlockLength, v.length); + uncompressedBlockLength += v.length; + numDocsInCurrentBlock++; + if (numDocsInCurrentBlock == Lucene80DocValuesFormat.BINARY_DOCS_PER_COMPRESSED_BLOCK) { +flushData(); + } } -meta.writeInt(numDocsWithField); -meta.writeInt(minLength); -meta.writeInt(maxLength); -if (maxLength > minLength) { - start = data.getFilePointer(); - meta.writeLong(start); +private void flushData() throws IOException { + if (numDocsInCurrentBlock > 0) { +// Write offset to this block to temporary offsets file +totalChunks++; +long thisBlockStartPointer = data.getFilePointer(); + +// Optimisation - check if all lengths are same +boolean allLengthsSame = true && numDocsInCurrentBlock >0 ; +for (int i = 0; i < Lucene80DocValuesFormat.BINARY_DOCS_PER_COMPRESSED_BLOCK && allLengthsSame; i++) { Review comment: in general we do a `break` when setting `allLengthsSame = false` instead of adding it to the exit condition of the for statement This is an automated message from the Apache Git Service. To respond to the
[GitHub] [lucene-solr] jpountz commented on a change in pull request #1234: Add compression for Binary doc value fields
jpountz commented on a change in pull request #1234: Add compression for Binary doc value fields URL: https://github.com/apache/lucene-solr/pull/1234#discussion_r379298761 ## File path: lucene/core/src/java/org/apache/lucene/codecs/lucene80/Lucene80DocValuesConsumer.java ## @@ -353,67 +360,193 @@ private void writeBlock(long[] values, int length, long gcd, ByteBuffersDataOutp } } - @Override - public void addBinaryField(FieldInfo field, DocValuesProducer valuesProducer) throws IOException { -meta.writeInt(field.number); -meta.writeByte(Lucene80DocValuesFormat.BINARY); - -BinaryDocValues values = valuesProducer.getBinary(field); -long start = data.getFilePointer(); -meta.writeLong(start); // dataOffset -int numDocsWithField = 0; -int minLength = Integer.MAX_VALUE; -int maxLength = 0; -for (int doc = values.nextDoc(); doc != DocIdSetIterator.NO_MORE_DOCS; doc = values.nextDoc()) { - numDocsWithField++; - BytesRef v = values.binaryValue(); - int length = v.length; - data.writeBytes(v.bytes, v.offset, v.length); - minLength = Math.min(length, minLength); - maxLength = Math.max(length, maxLength); + class CompressedBinaryBlockWriter implements Closeable { +FastCompressionHashTable ht = new LZ4.FastCompressionHashTable(); +int uncompressedBlockLength = 0; +int maxUncompressedBlockLength = 0; +int numDocsInCurrentBlock = 0; +int[] docLengths = new int[Lucene80DocValuesFormat.BINARY_DOCS_PER_COMPRESSED_BLOCK]; +byte[] block = new byte [1024 * 16]; +int totalChunks = 0; +long maxPointer = 0; +long blockAddressesStart = -1; + +private IndexOutput tempBinaryOffsets; + + +public CompressedBinaryBlockWriter() throws IOException { + tempBinaryOffsets = state.directory.createTempOutput(state.segmentInfo.name, "binary_pointers", state.context); + boolean success = false; + try { +CodecUtil.writeHeader(tempBinaryOffsets, Lucene80DocValuesFormat.META_CODEC + "FilePointers", Lucene80DocValuesFormat.VERSION_CURRENT); +success = true; + } finally { +if (success == false) { + IOUtils.closeWhileHandlingException(this); //self-close because constructor caller can't +} + } } -assert numDocsWithField <= maxDoc; -meta.writeLong(data.getFilePointer() - start); // dataLength -if (numDocsWithField == 0) { - meta.writeLong(-2); // docsWithFieldOffset - meta.writeLong(0L); // docsWithFieldLength - meta.writeShort((short) -1); // jumpTableEntryCount - meta.writeByte((byte) -1); // denseRankPower -} else if (numDocsWithField == maxDoc) { - meta.writeLong(-1); // docsWithFieldOffset - meta.writeLong(0L); // docsWithFieldLength - meta.writeShort((short) -1); // jumpTableEntryCount - meta.writeByte((byte) -1); // denseRankPower -} else { - long offset = data.getFilePointer(); - meta.writeLong(offset); // docsWithFieldOffset - values = valuesProducer.getBinary(field); - final short jumpTableEntryCount = IndexedDISI.writeBitSet(values, data, IndexedDISI.DEFAULT_DENSE_RANK_POWER); - meta.writeLong(data.getFilePointer() - offset); // docsWithFieldLength - meta.writeShort(jumpTableEntryCount); - meta.writeByte(IndexedDISI.DEFAULT_DENSE_RANK_POWER); +void addDoc(int doc, BytesRef v) throws IOException { + if (blockAddressesStart < 0) { +blockAddressesStart = data.getFilePointer(); + } + docLengths[numDocsInCurrentBlock] = v.length; + block = ArrayUtil.grow(block, uncompressedBlockLength + v.length); + System.arraycopy(v.bytes, v.offset, block, uncompressedBlockLength, v.length); + uncompressedBlockLength += v.length; + numDocsInCurrentBlock++; + if (numDocsInCurrentBlock == Lucene80DocValuesFormat.BINARY_DOCS_PER_COMPRESSED_BLOCK) { +flushData(); + } } -meta.writeInt(numDocsWithField); -meta.writeInt(minLength); -meta.writeInt(maxLength); -if (maxLength > minLength) { - start = data.getFilePointer(); - meta.writeLong(start); +private void flushData() throws IOException { + if (numDocsInCurrentBlock > 0) { +// Write offset to this block to temporary offsets file +totalChunks++; +long thisBlockStartPointer = data.getFilePointer(); + +// Optimisation - check if all lengths are same +boolean allLengthsSame = true && numDocsInCurrentBlock >0 ; Review comment: The second condition is necessary true given the parent if statement. ```suggestion boolean allLengthsSame = true; ``` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this
[GitHub] [lucene-solr] jpountz commented on a change in pull request #1234: Add compression for Binary doc value fields
jpountz commented on a change in pull request #1234: Add compression for Binary doc value fields URL: https://github.com/apache/lucene-solr/pull/1234#discussion_r379304074 ## File path: lucene/core/src/java/org/apache/lucene/codecs/lucene80/Lucene80DocValuesProducer.java ## @@ -59,18 +60,18 @@ private long ramBytesUsed; private final IndexInput data; private final int maxDoc; + private int version = -1; /** expert: instantiates a new reader */ Lucene80DocValuesProducer(SegmentReadState state, String dataCodec, String dataExtension, String metaCodec, String metaExtension) throws IOException { String metaName = IndexFileNames.segmentFileName(state.segmentInfo.name, state.segmentSuffix, metaExtension); this.maxDoc = state.segmentInfo.maxDoc(); ramBytesUsed = RamUsageEstimator.shallowSizeOfInstance(getClass()); -int version = -1; Review comment: maybe keep this variable actually, it would help make version final by doing `this.version = version;` after the try block? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] jpountz commented on a change in pull request #1234: Add compression for Binary doc value fields
jpountz commented on a change in pull request #1234: Add compression for Binary doc value fields URL: https://github.com/apache/lucene-solr/pull/1234#discussion_r379294799 ## File path: lucene/core/src/java/org/apache/lucene/codecs/lucene80/Lucene80DocValuesConsumer.java ## @@ -353,67 +360,193 @@ private void writeBlock(long[] values, int length, long gcd, ByteBuffersDataOutp } } - @Override - public void addBinaryField(FieldInfo field, DocValuesProducer valuesProducer) throws IOException { -meta.writeInt(field.number); -meta.writeByte(Lucene80DocValuesFormat.BINARY); - -BinaryDocValues values = valuesProducer.getBinary(field); -long start = data.getFilePointer(); -meta.writeLong(start); // dataOffset -int numDocsWithField = 0; -int minLength = Integer.MAX_VALUE; -int maxLength = 0; -for (int doc = values.nextDoc(); doc != DocIdSetIterator.NO_MORE_DOCS; doc = values.nextDoc()) { - numDocsWithField++; - BytesRef v = values.binaryValue(); - int length = v.length; - data.writeBytes(v.bytes, v.offset, v.length); - minLength = Math.min(length, minLength); - maxLength = Math.max(length, maxLength); + class CompressedBinaryBlockWriter implements Closeable { +FastCompressionHashTable ht = new LZ4.FastCompressionHashTable(); +int uncompressedBlockLength = 0; +int maxUncompressedBlockLength = 0; +int numDocsInCurrentBlock = 0; +int[] docLengths = new int[Lucene80DocValuesFormat.BINARY_DOCS_PER_COMPRESSED_BLOCK]; +byte[] block = new byte [1024 * 16]; +int totalChunks = 0; +long maxPointer = 0; +long blockAddressesStart = -1; + +private IndexOutput tempBinaryOffsets; Review comment: can we make `ht`, `tempBinaryOffsets`, `docLengths` final? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] jpountz commented on a change in pull request #1234: Add compression for Binary doc value fields
jpountz commented on a change in pull request #1234: Add compression for Binary doc value fields URL: https://github.com/apache/lucene-solr/pull/1234#discussion_r379297803 ## File path: lucene/core/src/java/org/apache/lucene/codecs/lucene80/Lucene80DocValuesConsumer.java ## @@ -353,67 +360,168 @@ private void writeBlock(long[] values, int length, long gcd, ByteBuffersDataOutp } } - @Override - public void addBinaryField(FieldInfo field, DocValuesProducer valuesProducer) throws IOException { -meta.writeInt(field.number); -meta.writeByte(Lucene80DocValuesFormat.BINARY); - -BinaryDocValues values = valuesProducer.getBinary(field); -long start = data.getFilePointer(); -meta.writeLong(start); // dataOffset -int numDocsWithField = 0; -int minLength = Integer.MAX_VALUE; -int maxLength = 0; -for (int doc = values.nextDoc(); doc != DocIdSetIterator.NO_MORE_DOCS; doc = values.nextDoc()) { - numDocsWithField++; - BytesRef v = values.binaryValue(); - int length = v.length; - data.writeBytes(v.bytes, v.offset, v.length); - minLength = Math.min(length, minLength); - maxLength = Math.max(length, maxLength); + class CompressedBinaryBlockWriter implements Closeable { +FastCompressionHashTable ht = new LZ4.FastCompressionHashTable(); +int uncompressedBlockLength = 0; +int maxUncompressedBlockLength = 0; +int numDocsInCurrentBlock = 0; +int [] docLengths = new int[Lucene80DocValuesFormat.BINARY_DOCS_PER_COMPRESSED_BLOCK]; +byte [] block = new byte [1024 * 16]; +int totalChunks = 0; +long maxPointer = 0; +long blockAddressesStart = -1; + +private IndexOutput tempBinaryOffsets; + + +public CompressedBinaryBlockWriter() throws IOException { + tempBinaryOffsets = state.directory.createTempOutput(state.segmentInfo.name, "binary_pointers", state.context); + try { +CodecUtil.writeHeader(tempBinaryOffsets, Lucene80DocValuesFormat.META_CODEC + "FilePointers", Lucene80DocValuesFormat.VERSION_CURRENT); + } catch (Throwable exception) { +IOUtils.closeWhileHandlingException(this); //self-close because constructor caller can't +throw exception; + } } -assert numDocsWithField <= maxDoc; -meta.writeLong(data.getFilePointer() - start); // dataLength -if (numDocsWithField == 0) { - meta.writeLong(-2); // docsWithFieldOffset - meta.writeLong(0L); // docsWithFieldLength - meta.writeShort((short) -1); // jumpTableEntryCount - meta.writeByte((byte) -1); // denseRankPower -} else if (numDocsWithField == maxDoc) { - meta.writeLong(-1); // docsWithFieldOffset - meta.writeLong(0L); // docsWithFieldLength - meta.writeShort((short) -1); // jumpTableEntryCount - meta.writeByte((byte) -1); // denseRankPower -} else { - long offset = data.getFilePointer(); - meta.writeLong(offset); // docsWithFieldOffset - values = valuesProducer.getBinary(field); - final short jumpTableEntryCount = IndexedDISI.writeBitSet(values, data, IndexedDISI.DEFAULT_DENSE_RANK_POWER); - meta.writeLong(data.getFilePointer() - offset); // docsWithFieldLength - meta.writeShort(jumpTableEntryCount); - meta.writeByte(IndexedDISI.DEFAULT_DENSE_RANK_POWER); +void addDoc(int doc, BytesRef v) throws IOException { + if (blockAddressesStart < 0) { +blockAddressesStart = data.getFilePointer(); + } Review comment: Have you found what this `something else` is? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] jpountz commented on a change in pull request #1234: Add compression for Binary doc value fields
jpountz commented on a change in pull request #1234: Add compression for Binary doc value fields URL: https://github.com/apache/lucene-solr/pull/1234#discussion_r379298212 ## File path: lucene/core/src/java/org/apache/lucene/codecs/lucene80/Lucene80DocValuesConsumer.java ## @@ -353,67 +360,193 @@ private void writeBlock(long[] values, int length, long gcd, ByteBuffersDataOutp } } - @Override - public void addBinaryField(FieldInfo field, DocValuesProducer valuesProducer) throws IOException { -meta.writeInt(field.number); -meta.writeByte(Lucene80DocValuesFormat.BINARY); - -BinaryDocValues values = valuesProducer.getBinary(field); -long start = data.getFilePointer(); -meta.writeLong(start); // dataOffset -int numDocsWithField = 0; -int minLength = Integer.MAX_VALUE; -int maxLength = 0; -for (int doc = values.nextDoc(); doc != DocIdSetIterator.NO_MORE_DOCS; doc = values.nextDoc()) { - numDocsWithField++; - BytesRef v = values.binaryValue(); - int length = v.length; - data.writeBytes(v.bytes, v.offset, v.length); - minLength = Math.min(length, minLength); - maxLength = Math.max(length, maxLength); + class CompressedBinaryBlockWriter implements Closeable { +FastCompressionHashTable ht = new LZ4.FastCompressionHashTable(); +int uncompressedBlockLength = 0; +int maxUncompressedBlockLength = 0; +int numDocsInCurrentBlock = 0; +int[] docLengths = new int[Lucene80DocValuesFormat.BINARY_DOCS_PER_COMPRESSED_BLOCK]; +byte[] block = new byte [1024 * 16]; +int totalChunks = 0; +long maxPointer = 0; +long blockAddressesStart = -1; + +private IndexOutput tempBinaryOffsets; + + +public CompressedBinaryBlockWriter() throws IOException { + tempBinaryOffsets = state.directory.createTempOutput(state.segmentInfo.name, "binary_pointers", state.context); + boolean success = false; + try { +CodecUtil.writeHeader(tempBinaryOffsets, Lucene80DocValuesFormat.META_CODEC + "FilePointers", Lucene80DocValuesFormat.VERSION_CURRENT); +success = true; + } finally { +if (success == false) { + IOUtils.closeWhileHandlingException(this); //self-close because constructor caller can't +} + } } -assert numDocsWithField <= maxDoc; -meta.writeLong(data.getFilePointer() - start); // dataLength -if (numDocsWithField == 0) { - meta.writeLong(-2); // docsWithFieldOffset - meta.writeLong(0L); // docsWithFieldLength - meta.writeShort((short) -1); // jumpTableEntryCount - meta.writeByte((byte) -1); // denseRankPower -} else if (numDocsWithField == maxDoc) { - meta.writeLong(-1); // docsWithFieldOffset - meta.writeLong(0L); // docsWithFieldLength - meta.writeShort((short) -1); // jumpTableEntryCount - meta.writeByte((byte) -1); // denseRankPower -} else { - long offset = data.getFilePointer(); - meta.writeLong(offset); // docsWithFieldOffset - values = valuesProducer.getBinary(field); - final short jumpTableEntryCount = IndexedDISI.writeBitSet(values, data, IndexedDISI.DEFAULT_DENSE_RANK_POWER); - meta.writeLong(data.getFilePointer() - offset); // docsWithFieldLength - meta.writeShort(jumpTableEntryCount); - meta.writeByte(IndexedDISI.DEFAULT_DENSE_RANK_POWER); +void addDoc(int doc, BytesRef v) throws IOException { + if (blockAddressesStart < 0) { +blockAddressesStart = data.getFilePointer(); + } + docLengths[numDocsInCurrentBlock] = v.length; + block = ArrayUtil.grow(block, uncompressedBlockLength + v.length); + System.arraycopy(v.bytes, v.offset, block, uncompressedBlockLength, v.length); + uncompressedBlockLength += v.length; + numDocsInCurrentBlock++; + if (numDocsInCurrentBlock == Lucene80DocValuesFormat.BINARY_DOCS_PER_COMPRESSED_BLOCK) { +flushData(); + } } -meta.writeInt(numDocsWithField); -meta.writeInt(minLength); -meta.writeInt(maxLength); -if (maxLength > minLength) { - start = data.getFilePointer(); - meta.writeLong(start); +private void flushData() throws IOException { + if (numDocsInCurrentBlock > 0) { +// Write offset to this block to temporary offsets file +totalChunks++; +long thisBlockStartPointer = data.getFilePointer(); + +// Optimisation - check if all lengths are same +boolean allLengthsSame = true && numDocsInCurrentBlock >0 ; +for (int i = 0; i < Lucene80DocValuesFormat.BINARY_DOCS_PER_COMPRESSED_BLOCK && allLengthsSame; i++) { + if (i > 0 && docLengths[i] != docLengths[i-1]) { +allLengthsSame = false; + } +} +if (allLengthsSame) { +// Only write one value shifted. Steal a bit to indicate all other lengths are the same +int onlyOneLength = (docLengths[0]
[GitHub] [lucene-solr] jpountz commented on a change in pull request #1234: Add compression for Binary doc value fields
jpountz commented on a change in pull request #1234: Add compression for Binary doc value fields URL: https://github.com/apache/lucene-solr/pull/1234#discussion_r379304369 ## File path: lucene/core/src/java/org/apache/lucene/codecs/lucene80/Lucene80DocValuesProducer.java ## @@ -182,10 +183,21 @@ private BinaryEntry readBinary(ChecksumIndexInput meta) throws IOException { entry.numDocsWithField = meta.readInt(); entry.minLength = meta.readInt(); entry.maxLength = meta.readInt(); -if (entry.minLength < entry.maxLength) { +if ((version >= Lucene80DocValuesFormat.VERSION_BIN_COMPRESSED && entry.numDocsWithField >0)|| entry.minLength < entry.maxLength) { Review comment: ```suggestion if ((version >= Lucene80DocValuesFormat.VERSION_BIN_COMPRESSED && entry.numDocsWithField > 0) || entry.minLength < entry.maxLength) { ``` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] jpountz commented on a change in pull request #1234: Add compression for Binary doc value fields
jpountz commented on a change in pull request #1234: Add compression for Binary doc value fields URL: https://github.com/apache/lucene-solr/pull/1234#discussion_r379295432 ## File path: lucene/core/src/java/org/apache/lucene/codecs/lucene80/Lucene80DocValuesConsumer.java ## @@ -353,67 +360,193 @@ private void writeBlock(long[] values, int length, long gcd, ByteBuffersDataOutp } } - @Override - public void addBinaryField(FieldInfo field, DocValuesProducer valuesProducer) throws IOException { -meta.writeInt(field.number); -meta.writeByte(Lucene80DocValuesFormat.BINARY); - -BinaryDocValues values = valuesProducer.getBinary(field); -long start = data.getFilePointer(); -meta.writeLong(start); // dataOffset -int numDocsWithField = 0; -int minLength = Integer.MAX_VALUE; -int maxLength = 0; -for (int doc = values.nextDoc(); doc != DocIdSetIterator.NO_MORE_DOCS; doc = values.nextDoc()) { - numDocsWithField++; - BytesRef v = values.binaryValue(); - int length = v.length; - data.writeBytes(v.bytes, v.offset, v.length); - minLength = Math.min(length, minLength); - maxLength = Math.max(length, maxLength); + class CompressedBinaryBlockWriter implements Closeable { +FastCompressionHashTable ht = new LZ4.FastCompressionHashTable(); +int uncompressedBlockLength = 0; +int maxUncompressedBlockLength = 0; +int numDocsInCurrentBlock = 0; +int[] docLengths = new int[Lucene80DocValuesFormat.BINARY_DOCS_PER_COMPRESSED_BLOCK]; +byte[] block = new byte [1024 * 16]; Review comment: Depending on the data that will be indexed it's very hard to know what is the right initial size here. Maybe start with an empty array? This will also give increase confidence that the resizing logic works. ```suggestion byte[] block = BytesRef.EMPTY_BYTES; ``` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] jpountz commented on a change in pull request #1234: Add compression for Binary doc value fields
jpountz commented on a change in pull request #1234: Add compression for Binary doc value fields URL: https://github.com/apache/lucene-solr/pull/1234#discussion_r379307116 ## File path: lucene/core/src/java/org/apache/lucene/codecs/lucene80/Lucene80DocValuesConsumer.java ## @@ -353,67 +360,193 @@ private void writeBlock(long[] values, int length, long gcd, ByteBuffersDataOutp } } - @Override - public void addBinaryField(FieldInfo field, DocValuesProducer valuesProducer) throws IOException { -meta.writeInt(field.number); -meta.writeByte(Lucene80DocValuesFormat.BINARY); - -BinaryDocValues values = valuesProducer.getBinary(field); -long start = data.getFilePointer(); -meta.writeLong(start); // dataOffset -int numDocsWithField = 0; -int minLength = Integer.MAX_VALUE; -int maxLength = 0; -for (int doc = values.nextDoc(); doc != DocIdSetIterator.NO_MORE_DOCS; doc = values.nextDoc()) { - numDocsWithField++; - BytesRef v = values.binaryValue(); - int length = v.length; - data.writeBytes(v.bytes, v.offset, v.length); - minLength = Math.min(length, minLength); - maxLength = Math.max(length, maxLength); + class CompressedBinaryBlockWriter implements Closeable { +FastCompressionHashTable ht = new LZ4.FastCompressionHashTable(); +int uncompressedBlockLength = 0; +int maxUncompressedBlockLength = 0; +int numDocsInCurrentBlock = 0; +int[] docLengths = new int[Lucene80DocValuesFormat.BINARY_DOCS_PER_COMPRESSED_BLOCK]; +byte[] block = new byte [1024 * 16]; +int totalChunks = 0; +long maxPointer = 0; +long blockAddressesStart = -1; + +private IndexOutput tempBinaryOffsets; + + +public CompressedBinaryBlockWriter() throws IOException { + tempBinaryOffsets = state.directory.createTempOutput(state.segmentInfo.name, "binary_pointers", state.context); + boolean success = false; + try { +CodecUtil.writeHeader(tempBinaryOffsets, Lucene80DocValuesFormat.META_CODEC + "FilePointers", Lucene80DocValuesFormat.VERSION_CURRENT); +success = true; + } finally { +if (success == false) { + IOUtils.closeWhileHandlingException(this); //self-close because constructor caller can't +} + } } -assert numDocsWithField <= maxDoc; -meta.writeLong(data.getFilePointer() - start); // dataLength -if (numDocsWithField == 0) { - meta.writeLong(-2); // docsWithFieldOffset - meta.writeLong(0L); // docsWithFieldLength - meta.writeShort((short) -1); // jumpTableEntryCount - meta.writeByte((byte) -1); // denseRankPower -} else if (numDocsWithField == maxDoc) { - meta.writeLong(-1); // docsWithFieldOffset - meta.writeLong(0L); // docsWithFieldLength - meta.writeShort((short) -1); // jumpTableEntryCount - meta.writeByte((byte) -1); // denseRankPower -} else { - long offset = data.getFilePointer(); - meta.writeLong(offset); // docsWithFieldOffset - values = valuesProducer.getBinary(field); - final short jumpTableEntryCount = IndexedDISI.writeBitSet(values, data, IndexedDISI.DEFAULT_DENSE_RANK_POWER); - meta.writeLong(data.getFilePointer() - offset); // docsWithFieldLength - meta.writeShort(jumpTableEntryCount); - meta.writeByte(IndexedDISI.DEFAULT_DENSE_RANK_POWER); +void addDoc(int doc, BytesRef v) throws IOException { + if (blockAddressesStart < 0) { +blockAddressesStart = data.getFilePointer(); + } + docLengths[numDocsInCurrentBlock] = v.length; + block = ArrayUtil.grow(block, uncompressedBlockLength + v.length); + System.arraycopy(v.bytes, v.offset, block, uncompressedBlockLength, v.length); + uncompressedBlockLength += v.length; + numDocsInCurrentBlock++; + if (numDocsInCurrentBlock == Lucene80DocValuesFormat.BINARY_DOCS_PER_COMPRESSED_BLOCK) { +flushData(); + } } -meta.writeInt(numDocsWithField); -meta.writeInt(minLength); -meta.writeInt(maxLength); -if (maxLength > minLength) { - start = data.getFilePointer(); - meta.writeLong(start); +private void flushData() throws IOException { + if (numDocsInCurrentBlock > 0) { +// Write offset to this block to temporary offsets file +totalChunks++; +long thisBlockStartPointer = data.getFilePointer(); + +// Optimisation - check if all lengths are same +boolean allLengthsSame = true && numDocsInCurrentBlock >0 ; +for (int i = 0; i < Lucene80DocValuesFormat.BINARY_DOCS_PER_COMPRESSED_BLOCK && allLengthsSame; i++) { + if (i > 0 && docLengths[i] != docLengths[i-1]) { Review comment: if you're only doing it for `i>0`, let's make the loop start at `i=1`? This is an automated message from the Apache Git Service. To respond to the
[GitHub] [lucene-solr] jpountz commented on a change in pull request #1234: Add compression for Binary doc value fields
jpountz commented on a change in pull request #1234: Add compression for Binary doc value fields URL: https://github.com/apache/lucene-solr/pull/1234#discussion_r379306326 ## File path: lucene/core/src/java/org/apache/lucene/codecs/lucene80/Lucene80DocValuesProducer.java ## @@ -182,10 +183,21 @@ private BinaryEntry readBinary(ChecksumIndexInput meta) throws IOException { entry.numDocsWithField = meta.readInt(); entry.minLength = meta.readInt(); entry.maxLength = meta.readInt(); -if (entry.minLength < entry.maxLength) { +if ((version >= Lucene80DocValuesFormat.VERSION_BIN_COMPRESSED && entry.numDocsWithField >0)|| entry.minLength < entry.maxLength) { entry.addressesOffset = meta.readLong(); + + // Old count of uncompressed addresses + long numAddresses = entry.numDocsWithField + 1L; + // New count of compressed addresses - the number of compresseed blocks + if (version >= Lucene80DocValuesFormat.VERSION_BIN_COMPRESSED) { +entry.numCompressedChunks = meta.readVInt(); +entry.docsPerChunk = meta.readVInt(); Review comment: maybe this should be the "shift" instead of the number of docs per chunk, so that you you directly have both the shift (as-is) and the mask `((1 << shift) - 1)` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9224) (ant) RAT report complains about ... solr/webapp rat-report.xml (from gradle)
[ https://issues.apache.org/jira/browse/LUCENE-9224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17036777#comment-17036777 ] Dawid Weiss commented on LUCENE-9224: - Thanks for digging, Erick. Chris - maintaining both ant and gradle on master is a bit of juggling. I'd really like to move to removing parts of the ant build as soon as possible. The precommit is still not fully equivalent; hope we'll get there soon. > (ant) RAT report complains about ... solr/webapp rat-report.xml (from gradle) > - > > Key: LUCENE-9224 > URL: https://issues.apache.org/jira/browse/LUCENE-9224 > Project: Lucene - Core > Issue Type: Bug >Reporter: Chris M. Hostetter >Assignee: Erick Erickson >Priority: Major > Attachments: LUCENE-9224.patch > > > I'm not sure if this is due to some comflagration of mixing gradle work with > ant work, but today i encountered the following failure after running "ant > clean precommit" ... > {noformat} > rat-sources: > [rat] > [rat] * > [rat] Summary > [rat] --- > [rat] Generated at: 2020-02-13T14:46:10-07:00 > [rat] Notes: 0 > [rat] Binaries: 1 > [rat] Archives: 0 > [rat] Standards: 95 > [rat] > [rat] Apache Licensed: 75 > [rat] Generated Documents: 0 > [rat] > [rat] JavaDocs are generated and so license header is optional > [rat] Generated files do not required license headers > [rat] > [rat] 1 Unknown Licenses > [rat] > [rat] *** > [rat] > [rat] Unapproved licenses: > [rat] > [rat] /home/hossman/lucene/dev/solr/webapp/build/rat/rat-report.xml > [rat] > [rat] *** > [rat] > [rat] Archives: > [rat] > [rat] * > [rat] Files with Apache License headers will be marked AL > [rat] Binary files (which do not require AL headers) will be marked B > [rat] Compressed archives will be marked A > [rat] Notices, licenses etc will be marked N > [rat] AL/home/hossman/lucene/dev/solr/webapp/build.xml > [rat] !? > /home/hossman/lucene/dev/solr/webapp/build/rat/rat-report.xml > [rat] AL/home/hossman/lucene/dev/solr/webapp/web/WEB-INF/web.xml > ... > {noformat} > RAT seems to be comlaining that there is no license header in it's own report > file? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org