[GitHub] [lucene] LuXugang commented on a diff in pull request #12078: Enhance XXXField#newRangeQuery
LuXugang commented on code in PR #12078: URL: https://github.com/apache/lucene/pull/12078#discussion_r1068923445 ## lucene/core/src/java/org/apache/lucene/document/LongField.java: ## @@ -108,8 +109,9 @@ public static Query newExactQuery(String field, long value) { */ public static Query newRangeQuery(String field, long lowerValue, long upperValue) { PointRangeQuery.checkArgs(field, lowerValue, upperValue); +Query fallbackQuery = LongPoint.newRangeQuery(field, lowerValue, upperValue); return new IndexOrDocValuesQuery( -LongPoint.newRangeQuery(field, lowerValue, upperValue), +new IndexSortSortedNumericDocValuesRangeQuery(field, lowerValue, upperValue, fallbackQuery), SortedNumericDocValuesField.newSlowRangeQuery(field, lowerValue, upperValue)); Review Comment: > IndexSortSortedNumericDocValuesRangeQuery is good at both so it should take precedence over both the index and the doc-values query? Thanks for helping, @jpountz , addressed in https://github.com/apache/lucene/pull/12078/commits/7adfdd3eabaffbe8fc1cf53711a7d570d57cd175 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] gsmiller commented on pull request #12073: Move ReqExclScorer exclusion checking into first-phase when the exclusion Scorer has no second-phase check
gsmiller commented on PR #12073: URL: https://github.com/apache/lucene/pull/12073#issuecomment-1381092287 Moved this into "draft" state until I'm able to come back and figure out why there was a benchmark improvement with this change, given the feedback that `ReqExclBulkScorer` would be expected to kick in. I've gotten busy with some other work, but plan to come back and dig into this. In the meantime though, moving to "draft." -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] zhaih commented on a diff in pull request #12050: Reuse HNSW graph for intialization during merge
zhaih commented on code in PR #12050: URL: https://github.com/apache/lucene/pull/12050#discussion_r1068747982 ## lucene/core/src/java/org/apache/lucene/util/hnsw/OnHeapHnswGraph.java: ## @@ -94,36 +93,83 @@ public int size() { } /** - * Add node on the given level + * Add node on the given level. Nodes can be inserted out of order, but it requires that the nodes Review Comment: > We need this index to query the OnHeapHnswGraph.graph to get the NeighborArray for a particular element ([ref](https://github.com/jmazanec15/lucene/blob/hnsw-merge-from-graph/lucene/core/src/java/org/apache/lucene/util/hnsw/OnHeapHnswGraph.java#L87)). I think we generally are not exposing the `graph` outside so the usage is only inside the OnHeapHnswGraph I believe? And inside the class I think that's the only place we use the nodeIndex, which is used for retrieving certain node after binary search. I think this won't be a problem for TreeSet or TreeMap (if we use node id as key and neighbor array as value)? Or have I overlooked some other usage of retrieving via nodeIndex? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jmazanec15 commented on a diff in pull request #12050: Reuse HNSW graph for intialization during merge
jmazanec15 commented on code in PR #12050: URL: https://github.com/apache/lucene/pull/12050#discussion_r1068562367 ## lucene/core/src/java/org/apache/lucene/util/hnsw/OnHeapHnswGraph.java: ## @@ -94,36 +93,83 @@ public int size() { } /** - * Add node on the given level + * Add node on the given level. Nodes can be inserted out of order, but it requires that the nodes Review Comment: Thinking about this more, one issue with a TreeSet based approach would be that we wouldnt be able to look up the index of a particular element (unless we converted the set to a list each time we need to do lookup). We need this index to query the OnHeapHnswGraph.graph to get the NeighborArray for a particular element ([ref](https://github.com/jmazanec15/lucene/blob/hnsw-merge-from-graph/lucene/core/src/java/org/apache/lucene/util/hnsw/OnHeapHnswGraph.java#L87)). I am not sure if there is a good way around this - are you aware of any? Alternatively, I was thinking if we wanted to avoid the expensive copy for inserting when position < idx, we could switch from using an int[] to an ArrayList to represent nodesByLevel for a particular level. This should avoid in most cases the extreme copy that out of order insertion would require. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] Brain2000 opened a new issue, #12082: LeafFieldComparator setBottom not being called before compareBottom
Brain2000 opened a new issue, #12082: URL: https://github.com/apache/lucene/issues/12082 ### Description It looks like there's a problem in the TopFieldCollector.java where it calls "compareBottom" without calling "setBottom" first. I believe this is an issue if getLeafComparer( ) is called more than once, as the new class will no longer have the previous bottom slot value. I found a workaround is to rescope the variable storing the bottom slot set from calling "setBottom( )", to outside of the LeafFieldComparator class to make it persistent (just like the setTopValue( ) is called outside of the LeafFieldComparator class). Rescoping the bottom slot should be safe unless multiple threads are being run at the same time and need their own version of bottomSlot for the LeafFieldComparator class. The documentation here https://lucene.apache.org/core/8_0_0/core/org/apache/lucene/search/LeafFieldComparator.html states that _"[setBottom] will always be called before compareBottom(int)"_ So one might think the LeafFieldComparator would be used like this: ``` //get new leaf comparator LeafFieldComparator lc = getLeafComparator( ); //setBottom before compareBottom lc.setBottom(xxx); lc.compareBottom(yyy); //get new leaf comparator lc = getLeafComparator( ); //setBottom before compareBottom lc.setBottom(xxx2); lc.compareBottom(yyy2); ``` However, it is being called in this fashion: ``` //get new leaf comparator LeafFieldComparator lc = getLeafComparator( ); //setBottom before compareBottom lc.setBottom(xxx); lc.compareBottom(yyy); //get new leaf comparator LeafFieldComparator lc = getLeafComparator( ); //compareBottom... hopefully the slot is correct... lc.compareBottom(yyy); ``` As you can see, the 2nd set of code does not treat the LeafComparator as if each instance has it's own bottomSlot variable. I believe the purpose of this would be to allow multiple threads to call this simultaneously, which would require each one to have it's own bottomSlot, otherwise it would not be thread safe. Also, if setBottom was meant to be outside of the scope of LeafComparator, then why was the time taken to put setTop outside and setBottom inside? This seems to have been done very deliberately according to the documentation. Here is a sample code snippet: ``` import org.apache.lucene.index.LeafReaderContext; import org.apache.lucene.search.FieldComparator; import org.apache.lucene.search.LeafFieldComparator; import org.apache.lucene.search.Scorable; public final class MySimpleStringComparator extends FieldComparator { private String[] values; private final String field; private String topValue; //SCOPING bottomSlot HERE WILL CAUSE ITEMS TO RETURN PROPERLY //private int bottomSlot; public MySimpleStringComparator(int numHits, String field) { this.values = new String[numHits]; this.field = field; } @Override public int compare(int slot1, int slot2) { return this.compareValues(this.values[slot1], this.values[slot2]); } @Override public LeafFieldComparator getLeafComparator(LeafReaderContext lrc) throws IOException { final LeafReaderContext frc = lrc; return new LeafFieldComparator() { //SCOPING bottomSlot HERE WILL CAUSE ITEMS ON A SUBSET TO RANDOMLY DISAPPEAR private int bottomSlot; public void setScorer(Scorable arg0) throws IOException { } public void setBottom(int slot) throws IOException { bottomSlot = slot; } public void copy(int slot, int doc) throws IOException { values[slot] = frc.reader().document(doc).get(field); } public int compareTop(int doc) throws IOException { return compareValues(topValue, frc.reader().document(doc).get(field)); } public int compareBottom(int doc) throws IOException { return compareValues(values[bottomSlot], frc.reader().document(doc).get(field)); } }; } @Override public void setTopValue(String val) { this.topValue = val; } @Override public String value(int slot) { return this.values[slot]; } } ``` Here is a code snippet that runs a search/sort for a single page that utilizes the MySimpleStringComparator above. ``` ComplexPhraseQueryParser cpqp
[GitHub] [lucene] jpountz commented on pull request #11807: No need to rewrite queries in unified highlighter
jpountz commented on PR #11807: URL: https://github.com/apache/lucene/pull/11807#issuecomment-1380775336 @romseygeek Should it be backported to branch_9x? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jpountz commented on pull request #11779: GITHUB#11778: Add detailed part-of-speech tag for particle and ending on Nori
jpountz commented on PR #11779: URL: https://github.com/apache/lucene/pull/11779#issuecomment-1380774395 @danmuzi Should this be backported to branch_9x? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jpountz merged pull request #12081: Speed up DocIdMerger on sorted indexes.
jpountz merged PR #12081: URL: https://github.com/apache/lucene/pull/12081 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jpountz merged pull request #12079: Speed up 1D BKD merging.
jpountz merged PR #12079: URL: https://github.com/apache/lucene/pull/12079 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] javanna commented on pull request #12072: Fix exponential runtime for Boolean#rewrite
javanna commented on PR #12072: URL: https://github.com/apache/lucene/pull/12072#issuecomment-1380596699 Thanks @benwtrent ! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] javanna closed issue #12069: Long rewrite times for deeply nested, non-scoring Boolean queries
javanna closed issue #12069: Long rewrite times for deeply nested, non-scoring Boolean queries URL: https://github.com/apache/lucene/issues/12069 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] javanna merged pull request #12072: Fix exponential runtime for Boolean#rewrite
javanna merged PR #12072: URL: https://github.com/apache/lucene/pull/12072 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jpountz commented on a diff in pull request #12078: Enhance XXXField#newRangeQuery
jpountz commented on code in PR #12078: URL: https://github.com/apache/lucene/pull/12078#discussion_r1068260998 ## lucene/core/src/java/org/apache/lucene/document/LongField.java: ## @@ -108,8 +109,9 @@ public static Query newExactQuery(String field, long value) { */ public static Query newRangeQuery(String field, long lowerValue, long upperValue) { PointRangeQuery.checkArgs(field, lowerValue, upperValue); +Query fallbackQuery = LongPoint.newRangeQuery(field, lowerValue, upperValue); return new IndexOrDocValuesQuery( -LongPoint.newRangeQuery(field, lowerValue, upperValue), +new IndexSortSortedNumericDocValuesRangeQuery(field, lowerValue, upperValue, fallbackQuery), SortedNumericDocValuesField.newSlowRangeQuery(field, lowerValue, upperValue)); Review Comment: I would have wrapped the IndexOrDocValuesQuery within the IndexSortSortedNumericDocValuesRangeQuery instead. IndexOrDocValuesQuery is needed because the point query is bad at skipping and the doc values query is bad at iterating over all matches, but when an index sort is configured, `IndexSortSortedNumericDocValuesRangeQuery` is good at both so it should take precedence over both the index and the doc-values query? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] dantuzi commented on pull request #12048: Move HNSW parameters to the HnswGraphBuilder class
dantuzi commented on PR #12048: URL: https://github.com/apache/lucene/pull/12048#issuecomment-1380490660 @msokolov I'm finalizing another PR that works with HnswGraph and, to create the graph, I use the constants `DEFAULT_MAX_CONN` and `DEFAULT_BEAM_WIDTH` already defined. So, in the future, these constants will also be used outside the class `HnswVectorsFormat`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] rmuir commented on a diff in pull request #12054: Introduce a new `KeywordField`.
rmuir commented on code in PR #12054: URL: https://github.com/apache/lucene/pull/12054#discussion_r1068156278 ## lucene/demo/src/java/org/apache/lucene/demo/IndexFiles.java: ## @@ -234,8 +234,8 @@ void indexDoc(IndexWriter writer, Path file, long lastModified) throws IOExcepti // field that is indexed (i.e. searchable), but don't tokenize // the field into separate words and don't index term frequency // or positional information: - Field pathField = new StringField("path", file.toString(), Field.Store.YES); - doc.add(pathField); + doc.add(new KeywordField("path", file.toString())); + doc.add(new StoredField("path", file.toString())); Review Comment: and obviously, fixing this can be a followup to this PR. but we should do it before releasing the new APIs. These new fields are supposed to be easy to use, so they should support storing as well. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] rmuir commented on a diff in pull request #12054: Introduce a new `KeywordField`.
rmuir commented on code in PR #12054: URL: https://github.com/apache/lucene/pull/12054#discussion_r1068154625 ## lucene/demo/src/java/org/apache/lucene/demo/IndexFiles.java: ## @@ -234,8 +234,8 @@ void indexDoc(IndexWriter writer, Path file, long lastModified) throws IOExcepti // field that is indexed (i.e. searchable), but don't tokenize // the field into separate words and don't index term frequency // or positional information: - Field pathField = new StringField("path", file.toString(), Field.Store.YES); - doc.add(pathField); + doc.add(new KeywordField("path", file.toString())); + doc.add(new StoredField("path", file.toString())); Review Comment: look at current stored fields writer as example. All these damn java abstractions, yet our codec writer is doing **TYPE-GUESSING?**. Let's add a new method, so the codec knows the type and never guesses. This "guessing" belongs as an impl detail behind a new method in Field.java IMO (because Field.java tries to be a superhero and support all types). For structured types like KeywordField it should just be `return STRING`. ``` Number number = field.numericValue(); if (number != null) { if (number instanceof Byte || number instanceof Short || number instanceof Integer) { bits = NUMERIC_INT; } else if (number instanceof Long) { bits = NUMERIC_LONG; } else if (number instanceof Float) { bits = NUMERIC_FLOAT; } else if (number instanceof Double) { bits = NUMERIC_DOUBLE; } else { throw new IllegalArgumentException("cannot store numeric type " + number.getClass()); } string = null; bytes = null; } else { bytes = field.binaryValue(); if (bytes != null) { bits = BYTE_ARR; string = null; } else { ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] rmuir commented on a diff in pull request #12054: Introduce a new `KeywordField`.
rmuir commented on code in PR #12054: URL: https://github.com/apache/lucene/pull/12054#discussion_r1068145212 ## lucene/demo/src/java/org/apache/lucene/demo/IndexFiles.java: ## @@ -234,8 +234,8 @@ void indexDoc(IndexWriter writer, Path file, long lastModified) throws IOExcepti // field that is indexed (i.e. searchable), but don't tokenize // the field into separate words and don't index term frequency // or positional information: - Field pathField = new StringField("path", file.toString(), Field.Store.YES); - doc.add(pathField); + doc.add(new KeywordField("path", file.toString())); + doc.add(new StoredField("path", file.toString())); Review Comment: by self-created i mean, too many java abstractions / java abstractions are the things causing the issue. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] rmuir commented on a diff in pull request #12054: Introduce a new `KeywordField`.
rmuir commented on code in PR #12054: URL: https://github.com/apache/lucene/pull/12054#discussion_r1068144382 ## lucene/demo/src/java/org/apache/lucene/demo/IndexFiles.java: ## @@ -234,8 +234,8 @@ void indexDoc(IndexWriter writer, Path file, long lastModified) throws IOExcepti // field that is indexed (i.e. searchable), but don't tokenize // the field into separate words and don't index term frequency // or positional information: - Field pathField = new StringField("path", file.toString(), Field.Store.YES); - doc.add(pathField); + doc.add(new KeywordField("path", file.toString())); + doc.add(new StoredField("path", file.toString())); Review Comment: Right, its a self-created problem though, because we know it should go into storedfields as a string. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jpountz commented on a diff in pull request #12054: Introduce a new `KeywordField`.
jpountz commented on code in PR #12054: URL: https://github.com/apache/lucene/pull/12054#discussion_r1068141063 ## lucene/demo/src/java/org/apache/lucene/demo/IndexFiles.java: ## @@ -234,8 +234,8 @@ void indexDoc(IndexWriter writer, Path file, long lastModified) throws IOExcepti // field that is indexed (i.e. searchable), but don't tokenize // the field into separate words and don't index term frequency // or positional information: - Field pathField = new StringField("path", file.toString(), Field.Store.YES); - doc.add(pathField); + doc.add(new KeywordField("path", file.toString())); + doc.add(new StoredField("path", file.toString())); Review Comment: I don't mind killing the BytesRef ctor, but I don't think it would be enough. We need this field to implement `binaryValue()` so that doc values can be indexed. But then stored fields are going to see a field where both `stringValue()` and `binaryValue()` return non-null, and it would be a problem for the current stored fields format which checks the binary value first, so this `KeywordField` would be considered as a binary field by stored fields. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jpountz commented on pull request #12081: Speed up DocIdMerger on sorted indexes.
jpountz commented on PR #12081: URL: https://github.com/apache/lucene/pull/12081#issuecomment-1380369577 Here are timings of the first doc value merges on the IndexTaxis benchmark and a sorted dense index: ``` SM 0 [2023-01-12T13:17:47.581987785Z; Thread-0]: 564 ms to merge doc values [367596 docs] SM 0 [2023-01-12T13:17:52.428441520Z; Thread-0]: 392 ms to merge doc values [365886 docs] SM 0 [2023-01-12T13:17:57.028935759Z; Thread-0]: 377 ms to merge doc values [365212 docs] SM 0 [2023-01-12T13:18:01.642897130Z; Thread-0]: 379 ms to merge doc values [367056 docs] SM 0 [2023-01-12T13:18:06.191013759Z; Thread-0]: 368 ms to merge doc values [364343 docs] SM 0 [2023-01-12T13:18:10.726133213Z; Thread-0]: 367 ms to merge doc values [363438 docs] SM 0 [2023-01-12T13:18:15.299072784Z; Thread-0]: 374 ms to merge doc values [366954 docs] SM 0 [2023-01-12T13:18:19.907346178Z; Thread-0]: 370 ms to merge doc values [365804 docs] SM 0 [2023-01-12T13:18:24.545160224Z; Thread-0]: 371 ms to merge doc values [366997 docs] SM 0 [2023-01-12T13:18:29.163147828Z; Thread-0]: 374 ms to merge doc values [367039 docs] SM 0 [2023-01-12T13:18:34.031698013Z; Thread-0]: 3709 ms to merge doc values [3660325 docs] SM 0 [2023-01-12T13:18:41.003270469Z; Thread-0]: 369 ms to merge doc values [365722 docs] SM 0 [2023-01-12T13:18:45.598362451Z; Thread-0]: 369 ms to merge doc values [365728 docs] SM 0 [2023-01-12T13:18:50.243855116Z; Thread-0]: 376 ms to merge doc values [367656 docs] SM 0 [2023-01-12T13:18:54.862259612Z; Thread-0]: 368 ms to merge doc values [365076 docs] SM 0 [2023-01-12T13:18:59.503139644Z; Thread-0]: 372 ms to merge doc values [365477 docs] SM 0 [2023-01-12T13:19:04.151794532Z; Thread-0]: 374 ms to merge doc values [367064 docs] SM 0 [2023-01-12T13:19:08.810873192Z; Thread-0]: 373 ms to merge doc values [367412 docs] SM 0 [2023-01-12T13:19:13.460733970Z; Thread-0]: 371 ms to merge doc values [366363 docs] SM 0 [2023-01-12T13:19:18.065982734Z; Thread-0]: 384 ms to merge doc values [366018 docs] SM 0 [2023-01-12T13:19:22.840259876Z; Thread-0]: 384 ms to merge doc values [365759 docs] SM 0 [2023-01-12T13:19:27.916927516Z; Thread-0]: 3842 ms to merge doc values [3662275 docs] SM 0 [2023-01-12T13:19:35.187165246Z; Thread-0]: 386 ms to merge doc values [366840 docs] ``` Now the same log with the change for the same set of merges: ``` SM 0 [2023-01-12T13:27:49.359139222Z; Thread-0]: 482 ms to merge doc values [367596 docs] SM 0 [2023-01-12T13:27:54.237137785Z; Thread-0]: 380 ms to merge doc values [365886 docs] SM 0 [2023-01-12T13:27:59.001734090Z; Thread-0]: 371 ms to merge doc values [365212 docs] SM 0 [2023-01-12T13:28:03.713076713Z; Thread-0]: 365 ms to merge doc values [367056 docs] SM 0 [2023-01-12T13:28:08.355025029Z; Thread-0]: 352 ms to merge doc values [364343 docs] SM 0 [2023-01-12T13:28:13.032506866Z; Thread-0]: 352 ms to merge doc values [363438 docs] SM 0 [2023-01-12T13:28:17.722795068Z; Thread-0]: 354 ms to merge doc values [366954 docs] SM 0 [2023-01-12T13:28:22.468102730Z; Thread-0]: 367 ms to merge doc values [365804 docs] SM 0 [2023-01-12T13:28:27.314385233Z; Thread-0]: 363 ms to merge doc values [366997 docs] SM 0 [2023-01-12T13:28:32.162119724Z; Thread-0]: 366 ms to merge doc values [367039 docs] SM 0 [2023-01-12T13:28:37.027154229Z; Thread-0]: 3678 ms to merge doc values [3660325 docs] SM 0 [2023-01-12T13:28:44.367846338Z; Thread-0]: 364 ms to merge doc values [365722 docs] SM 0 [2023-01-12T13:28:49.208172350Z; Thread-0]: 365 ms to merge doc values [365728 docs] SM 0 [2023-01-12T13:28:54.096854303Z; Thread-0]: 367 ms to merge doc values [367656 docs] SM 0 [2023-01-12T13:28:58.937724069Z; Thread-0]: 355 ms to merge doc values [365076 docs] SM 0 [2023-01-12T13:29:03.645241892Z; Thread-0]: 350 ms to merge doc values [365477 docs] SM 0 [2023-01-12T13:29:08.355256074Z; Thread-0]: 355 ms to merge doc values [367064 docs] SM 0 [2023-01-12T13:29:13.091695689Z; Thread-0]: 353 ms to merge doc values [367412 docs] SM 0 [2023-01-12T13:29:17.813632707Z; Thread-0]: 353 ms to merge doc values [366363 docs] SM 0 [2023-01-12T13:29:22.522670731Z; Thread-0]: 355 ms to merge doc values [366018 docs] SM 0 [2023-01-12T13:29:27.252547913Z; Thread-0]: 357 ms to merge doc values [365759 docs] SM 0 [2023-01-12T13:29:32.005065809Z; Thread-0]: 3580 ms to merge doc values [3662275 docs] SM 0 [2023-01-12T13:29:39.224502172Z; Thread-0]: 359 ms to merge doc values [366840 docs] ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [lucene] jpountz opened a new pull request, #12081: Speed up DocIdMerger on sorted indexes.
jpountz opened a new pull request, #12081: URL: https://github.com/apache/lucene/pull/12081 In the case when an index is sorted on a low-cardinality field, or the index sort order correlates with the order in which documents get ingested, we can optimize `SortedDocIDMerger` by doing a single comparison with the doc ID on the next sub. This checks covers at the same time whether the priority queue needs reordering and whether the current sub reached `NO_MORE_DOCS`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] mmatela opened a new issue, #12080: SynonymGraphFilter: wrong output token position when input positions overlap
mmatela opened a new issue, #12080: URL: https://github.com/apache/lucene/issues/12080 ### Description In my example, the query is 'test polskie'. I use MorfologikFilter for Polish stemming, it turns 'polskie' into 'polski' + 'polskie'. I also use SynonymGraphFilter which turns 'polski' into 'pol'. It's applied **only for query**. Here's what I see in quey analysis (token position in parenthesis): ``` Tokenizer: test(1) polskie(2) MF: test(1) polskie(2) polski(2) SGF: test(1) polskie(2) pol(3) polski(3). ``` When I search for "test polskie" with quotation marks, a document with the same text doesn't match, because SGF changes positions of tokens in query compared to index. In documentation, the description for the old `SynonymFilter` says "_The position value of the new tokens are set such they all occur at the same position as the original token._" In `SynonymGraphFilter` instead they are set to a position after the previous token. Is that an intentional change? Doesn't seem so, because it doesn't work as expected in my example. Looking at the code, it seems the problem is in https://github.com/apache/lucene/blob/main/lucene/analysis/common/src/java/org/apache/lucene/analysis/synonym/SynonymGraphFilter.java#L246: `nextNodeOut = lastNodeOut + posLenAtt.getPositionLength();` `nextNodeOut` is always set as the position after the current token, and that is later used as position of output token. I tried to remove this line and instead set this field right after the call to `input.incrementToken()`, in line 340: `nextNodeOut = lastNodeOut + posIncrAtt.getPositionIncrement();` This sets it to the original token's position. This way the final positions are `SGF: test(1) polskie(2) pol(2) polski(2).` and my document does match. I didn't experience any unexpected side effects. Hope this helps. I'm not familiar with the project enough to easilly submit a proper pull request, with tests and all. ### Version and environment details lucene 9.4 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] rmuir commented on a diff in pull request #12054: Introduce a new `KeywordField`.
rmuir commented on code in PR #12054: URL: https://github.com/apache/lucene/pull/12054#discussion_r1068105620 ## lucene/demo/src/java/org/apache/lucene/demo/IndexFiles.java: ## @@ -234,8 +234,8 @@ void indexDoc(IndexWriter writer, Path file, long lastModified) throws IOExcepti // field that is indexed (i.e. searchable), but don't tokenize // the field into separate words and don't index term frequency // or positional information: - Field pathField = new StringField("path", file.toString(), Field.Store.YES); - doc.add(pathField); + doc.add(new KeywordField("path", file.toString())); + doc.add(new StoredField("path", file.toString())); Review Comment: as far as the LongField etc, we should deal with that in another issue. I agree it should support Field.Store.YES/NO. The things are you speak of are not barriers to that. They are self-created problems that we can fix. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] rmuir commented on a diff in pull request #12054: Introduce a new `KeywordField`.
rmuir commented on code in PR #12054: URL: https://github.com/apache/lucene/pull/12054#discussion_r1068103957 ## lucene/demo/src/java/org/apache/lucene/demo/IndexFiles.java: ## @@ -234,8 +234,8 @@ void indexDoc(IndexWriter writer, Path file, long lastModified) throws IOExcepti // field that is indexed (i.e. searchable), but don't tokenize // the field into separate words and don't index term frequency // or positional information: - Field pathField = new StringField("path", file.toString(), Field.Store.YES); - doc.add(pathField); + doc.add(new KeywordField("path", file.toString())); + doc.add(new StoredField("path", file.toString())); Review Comment: Whoah... this field is for strings. Kill the BytesRef constructor. It is enough to pass a String. and you can support Field.Store.YES/NO -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jpountz commented on a diff in pull request #12054: Introduce a new `KeywordField`.
jpountz commented on code in PR #12054: URL: https://github.com/apache/lucene/pull/12054#discussion_r1068102375 ## lucene/demo/src/java/org/apache/lucene/demo/IndexFiles.java: ## @@ -234,8 +234,8 @@ void indexDoc(IndexWriter writer, Path file, long lastModified) throws IOExcepti // field that is indexed (i.e. searchable), but don't tokenize // the field into separate words and don't index term frequency // or positional information: - Field pathField = new StringField("path", file.toString(), Field.Store.YES); - doc.add(pathField); + doc.add(new KeywordField("path", file.toString())); + doc.add(new StoredField("path", file.toString())); Review Comment: Right. The challenge I'm seeing is that to index both points and doc values on numeric fields, we're creating a field that produces a binaryValue() consumed by points, as well as a numeric value consumed by doc values. But stored fields can store both binary and numeric data, so how should they know which value they should look at? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jpountz commented on pull request #12079: Speed up 1D BKD merging.
jpountz commented on PR #12079: URL: https://github.com/apache/lucene/pull/12079#issuecomment-1380286154 I remember thinking about it, and there are legitimate use-cases for `Arrays#compareUnsigned` like `BytesRef#compareTo`. Another thing is that `ArrayUtil#getUnsignedComparator` only helps if we expect the lengths to compare to be commonly 4 or 8, otherwise it just wraps `Arrays#compareUnsigned`. But to your point, maybe we could do another round or review of all call sites or `Arrays#compareUnsigned` to see if there are some of them that should switch to `ArrayUtil#getUnsignedComparator`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] rmuir commented on a diff in pull request #12054: Introduce a new `KeywordField`.
rmuir commented on code in PR #12054: URL: https://github.com/apache/lucene/pull/12054#discussion_r1068060646 ## lucene/demo/src/java/org/apache/lucene/demo/IndexFiles.java: ## @@ -234,8 +234,8 @@ void indexDoc(IndexWriter writer, Path file, long lastModified) throws IOExcepti // field that is indexed (i.e. searchable), but don't tokenize // the field into separate words and don't index term frequency // or positional information: - Field pathField = new StringField("path", file.toString(), Field.Store.YES); - doc.add(pathField); + doc.add(new KeywordField("path", file.toString())); + doc.add(new StoredField("path", file.toString())); Review Comment: > If you also need to store the value, you should add a separate {@link StoredField} instance. Let's rethink this for the new fields we are adding. I think storing is quite common and offering Field.Store.YES/NO is the best choice. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] iverase commented on pull request #12079: Speed up 1D BKD merging.
iverase commented on PR #12079: URL: https://github.com/apache/lucene/pull/12079#issuecomment-1380193686 I wonder if we should add `Arrays.compareUnsigned` to forbidden APIs to force always to use the faster comparators. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] javanna commented on pull request #12072: Fix exponential runtime for Boolean#rewrite
javanna commented on PR #12072: URL: https://github.com/apache/lucene/pull/12072#issuecomment-1380141646 @benwtrent could you add a changelog entry too? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] romseygeek commented on issue #10458: Ordered intervals can give inaccurate hits on interleaved terms [LUCENE-9418]
romseygeek commented on issue #10458: URL: https://github.com/apache/lucene/issues/10458#issuecomment-1380139574 Hi @Brain2000, yes please open a separate issue. If you could include a reproduction then we can work out if it's an issue with TopFieldCollector itself, or with how you're using it and if we need to improve the API documentation. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jpountz opened a new pull request, #12079: Speed up 1D BKD merging.
jpountz opened a new pull request, #12079: URL: https://github.com/apache/lucene/pull/12079 On the NYC taxis dataset on my local machine, switching from `Arrays#compareUnsigned` to `ArrayUtil#getUnsignedComparator` yielded a 15% speedup of BKD merging. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jpountz commented on a diff in pull request #12072: Fix exponential runtime for Boolean#rewrite
jpountz commented on code in PR #12072: URL: https://github.com/apache/lucene/pull/12072#discussion_r1067925752 ## lucene/core/src/java/org/apache/lucene/search/BooleanQuery.java: ## @@ -203,9 +203,18 @@ BooleanQuery rewriteNoScoring(IndexSearcher indexSearcher) throws IOException { for (BooleanClause clause : clauses) { Query query = clause.getQuery(); - Query rewritten = new ConstantScoreQuery(query).rewrite(indexSearcher); - if (rewritten instanceof ConstantScoreQuery) { -rewritten = ((ConstantScoreQuery) rewritten).getQuery(); + // NOTE: rewritingNoScoring() should not call rewrite(), otherwise this + // method could run in exponential time with the depth of the query as + // every new level would rewrite 2x more than its parent level. + Query rewritten = query; + if (query instanceof BoostQuery) { +rewritten = ((BoostQuery) query).getQuery(); + } + if (query instanceof ConstantScoreQuery) { +rewritten = ((ConstantScoreQuery) query).getQuery(); + } + if (query instanceof BooleanQuery) { +rewritten = ((BooleanQuery) query).rewriteNoScoring(indexSearcher); Review Comment: Sorry for the confusion, my suggestion should have kept the call to rewriteNoScoring on boolean queries, otherwise we're missing simplifications on inner queries. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jpountz merged pull request #12053: Allow reusing indexed binary fields.
jpountz merged PR #12053: URL: https://github.com/apache/lucene/pull/12053 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jpountz commented on pull request #12053: Allow reusing indexed binary fields.
jpountz commented on PR #12053: URL: https://github.com/apache/lucene/pull/12053#issuecomment-1379974864 Thanks @rmuir ! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org