Re: [PR] Make FSTCompiler.compile() to only return the FSTMetadata [lucene]

2023-12-08 Thread via GitHub
dungba88 commented on code in PR #12831: URL: https://github.com/apache/lucene/pull/12831#discussion_r1420866991 ## lucene/core/src/java/org/apache/lucene/util/fst/FST.java: ## @@ -503,9 +518,7 @@ public FSTMetadata getMetadata() { } /** - * Save the FST to DataOutput.

Re: [I] Corruption read on term dictionaries in Lucene 9.9 [lucene]

2023-12-08 Thread via GitHub
mikemccand commented on issue #12895: URL: https://github.com/apache/lucene/issues/12895#issuecomment-1848198927 I am travelling this weekend and unlikely to make much progress on this until early next week. Maybe we just revert and release 9.9.1 now? -- This is an automated messag

Re: [PR] Output well-formed UTF-8 bytes in SimpleTextCodec's segmentinfos [lucene]

2023-12-08 Thread via GitHub
msfroh commented on PR #12897: URL: https://github.com/apache/lucene/pull/12897#issuecomment-1848071583 If needed, I'm happy to add versions of `testFileIsUTF8()` for the other SimpleTextCodec format unit tests. -- This is an automated message from the Apache Git Service. To respond to th

[PR] Output well-formed UTF-8 bytes in SimpleTextCodec's segmentinfos [lucene]

2023-12-08 Thread via GitHub
msfroh opened a new pull request, #12897: URL: https://github.com/apache/lucene/pull/12897 ### Description The SimpleTextSegmentInfoFormat was writing the random byte array used as a segment's ID directly -- not converting to a simple text representation of the byte array. As a resul

Re: [I] Corruption read on term dictionaries in Lucene 9.9 [lucene]

2023-12-08 Thread via GitHub
benwtrent commented on issue #12895: URL: https://github.com/apache/lucene/issues/12895#issuecomment-1848037016 I think if a fix for this isn't found early next week, we should consider reverting it. No user should upgrade to Lucene 9.9.0 with this bug. -- This is an automated mess

Re: [PR] Fix for the bug where JapaneseReadingFormFilter cannot convert some hiragana to romaji [lucene]

2023-12-08 Thread via GitHub
kuramitsu commented on code in PR #12885: URL: https://github.com/apache/lucene/pull/12885#discussion_r1421180794 ## lucene/analysis/kuromoji/src/java/org/apache/lucene/analysis/ja/JapaneseReadingFormFilter.java: ## @@ -43,10 +43,30 @@ public JapaneseReadingFormFilter(TokenStrea

Re: [PR] Fix for the bug where JapaneseReadingFormFilter cannot convert some hiragana to romaji [lucene]

2023-12-08 Thread via GitHub
kuramitsu commented on code in PR #12885: URL: https://github.com/apache/lucene/pull/12885#discussion_r1421180794 ## lucene/analysis/kuromoji/src/java/org/apache/lucene/analysis/ja/JapaneseReadingFormFilter.java: ## @@ -43,10 +43,30 @@ public JapaneseReadingFormFilter(TokenStrea

Re: [PR] Fix for the bug where JapaneseReadingFormFilter cannot convert some hiragana to romaji [lucene]

2023-12-08 Thread via GitHub
kuramitsu commented on code in PR #12885: URL: https://github.com/apache/lucene/pull/12885#discussion_r1421172943 ## lucene/analysis/kuromoji/src/java/org/apache/lucene/analysis/ja/JapaneseReadingFormFilter.java: ## @@ -43,10 +43,30 @@ public JapaneseReadingFormFilter(TokenStrea

Re: [I] Reproducible TestDrillSideways failure [lucene]

2023-12-08 Thread via GitHub
gsmiller commented on issue #12418: URL: https://github.com/apache/lucene/issues/12418#issuecomment-1847976644 OK merged #12853 which I think fixes the root cause of this randomized test failures. I'm going to resolve out this issue and will keep an eye on nightlies for any new failures.

Re: [I] IntTaxonomyFacets chooses dense values array when FacetsCollector has no MatchingDocs [lucene]

2023-12-08 Thread via GitHub
gsmiller commented on issue #12558: URL: https://github.com/apache/lucene/issues/12558#issuecomment-1847976273 Fixed the root cause of this in #12853 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go

Re: [PR] Ensure #finish is called on all drill-sideways FacetCollectors even when no hits are scored [lucene]

2023-12-08 Thread via GitHub
gsmiller commented on PR #12853: URL: https://github.com/apache/lucene/pull/12853#issuecomment-1847976009 Thanks @gautamworah96 for taking a look! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to

Re: [PR] Ensure #finish is called on all drill-sideways FacetCollectors even when no hits are scored [lucene]

2023-12-08 Thread via GitHub
gsmiller commented on code in PR #12853: URL: https://github.com/apache/lucene/pull/12853#discussion_r1421103992 ## lucene/facet/src/java/org/apache/lucene/facet/DrillSidewaysQuery.java: ## @@ -193,42 +204,29 @@ public BulkScorer bulkScorer(LeafReaderContext context) throws IOE

Re: [PR] Ensure #finish is called on all drill-sideways FacetCollectors even when no hits are scored [lucene]

2023-12-08 Thread via GitHub
gsmiller merged PR #12853: URL: https://github.com/apache/lucene/pull/12853 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.ap

Re: [PR] Ensure #finish is called on all drill-sideways FacetCollectors even when no hits are scored [lucene]

2023-12-08 Thread via GitHub
gautamworah96 commented on code in PR #12853: URL: https://github.com/apache/lucene/pull/12853#discussion_r1421069686 ## lucene/facet/src/java/org/apache/lucene/facet/DrillSidewaysQuery.java: ## @@ -193,42 +204,29 @@ public BulkScorer bulkScorer(LeafReaderContext context) throw

Re: [PR] Ensure #finish is called on all drill-sideways FacetCollectors even when no hits are scored [lucene]

2023-12-08 Thread via GitHub
gautamworah96 commented on code in PR #12853: URL: https://github.com/apache/lucene/pull/12853#discussion_r1421077089 ## lucene/facet/src/java/org/apache/lucene/facet/DrillSidewaysQuery.java: ## @@ -193,42 +204,29 @@ public BulkScorer bulkScorer(LeafReaderContext context) throw

Re: [I] Corruption read on term dictionaries in Lucene 9.9 [lucene]

2023-12-08 Thread via GitHub
benwtrent commented on issue #12895: URL: https://github.com/apache/lucene/issues/12895#issuecomment-1847858897 @mikemccand I have to use at a minimum: `wikibig1m` for it to replicate. Couple of weird things I noticed in that optimization PR: - https://github.com/apache/lucene

Re: [I] Reproducible failure of TestParentBlockJoinByteKnnVectorQuery.testScoringWithMultipleChildren [lucene]

2023-12-08 Thread via GitHub
zhaih commented on issue #12896: URL: https://github.com/apache/lucene/issues/12896#issuecomment-1847830605 Oh probably not, the test is just using the default merge policy (TMP) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub

Re: [I] Reproducible failure of TestParentBlockJoinByteKnnVectorQuery.testScoringWithMultipleChildren [lucene]

2023-12-08 Thread via GitHub
zhaih commented on issue #12896: URL: https://github.com/apache/lucene/issues/12896#issuecomment-1847828807 I think it might due to the same problem as: https://github.com/apache/lucene/pull/12889 e.g. a doc reorder merge policy reordered the parent child block I haven't check it myse

Re: [PR] Fix for the bug where JapaneseReadingFormFilter cannot convert some hiragana to romaji [lucene]

2023-12-08 Thread via GitHub
zhaih commented on code in PR #12885: URL: https://github.com/apache/lucene/pull/12885#discussion_r1421011978 ## lucene/analysis/kuromoji/src/java/org/apache/lucene/analysis/ja/JapaneseReadingFormFilter.java: ## @@ -43,10 +43,30 @@ public JapaneseReadingFormFilter(TokenStream in

Re: [I] Corruption read on term dictionaries in Lucene 9.9 [lucene]

2023-12-08 Thread via GitHub
mikemccand commented on issue #12895: URL: https://github.com/apache/lucene/issues/12895#issuecomment-1847786318 It's also curious that it's not happening w/ 9.9 created indices. #12699 is about optimizing how we accumulate the long output while traversing (reading) the FST block tree term

Re: [I] Corruption read on term dictionaries in Lucene 9.9 [lucene]

2023-12-08 Thread via GitHub
mikemccand commented on issue #12895: URL: https://github.com/apache/lucene/issues/12895#issuecomment-1847782193 Ugh -- I'll try to look at this later today. Disappointing that our back compat test specifically for reading 9.8 indices failed to catch this. -- This is an automated message

[I] Reproducible failure of TestParentBlockJoinByteKnnVectorQuery.testScoringWithMultipleChildren [lucene]

2023-12-08 Thread via GitHub
gsmiller opened a new issue, #12896: URL: https://github.com/apache/lucene/issues/12896 ### Description Saw this test fail a couple times in automated builds (e.g., [here](https://jenkins.thetaphi.de/job/Lucene-main-Windows/13501/testReport/junit/org.apache.lucene.search.join/TestPare

Re: [I] Corruption read on term dictionaries in Lucene 9.9 [lucene]

2023-12-08 Thread via GitHub
jpountz commented on issue #12895: URL: https://github.com/apache/lucene/issues/12895#issuecomment-1847604281 I have a 9.8 index that reproduces the bug and ran a `git bisect` to figure out the first commit that fails, it pointed to #12699. -- This is an automated message from the Apache

Re: [I] Corruption read on term dictionaries in Lucene 9.9 [lucene]

2023-12-08 Thread via GitHub
benwtrent commented on issue #12895: URL: https://github.com/apache/lucene/issues/12895#issuecomment-1847604189 Git bisect has confirmed the read corruption occurs with: https://github.com/apache/lucene/pull/12699 -- This is an automated message from the Apache Git Service. To respond to

Re: [PR] Fix NPE on off-heap test and FST is null [lucene]

2023-12-08 Thread via GitHub
gsmiller merged PR #12894: URL: https://github.com/apache/lucene/pull/12894 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.ap

Re: [PR] Fix NPE on off-heap test and FST is null [lucene]

2023-12-08 Thread via GitHub
gsmiller commented on code in PR #12894: URL: https://github.com/apache/lucene/pull/12894#discussion_r1420835887 ## lucene/test-framework/src/java/org/apache/lucene/tests/util/fst/FSTTester.java: ## @@ -283,14 +283,17 @@ public FST doTest() throws IOException { } }

Re: [PR] Move group-varint encoding/decoding logic to DataOutput/DataInput [lucene]

2023-12-08 Thread via GitHub
easyice commented on code in PR #12841: URL: https://github.com/apache/lucene/pull/12841#discussion_r1420843242 ## lucene/core/src/test/org/apache/lucene/store/TestMMapDirectory.java: ## @@ -114,4 +115,31 @@ public void testNullParamsIndexInput() throws Exception { }

Re: [I] Corruption read on term dictionaries in Lucene 9.9 [lucene]

2023-12-08 Thread via GitHub
benwtrent commented on issue #12895: URL: https://github.com/apache/lucene/issues/12895#issuecomment-1847590622 Possibly related: https://github.com/apache/lucene/pull/12631 NOTE: the read corruption doesn't occur when reading from an index created in 9.9. -- This is an automated m

Re: [PR] [Minor] Quick exit for non-zero slice buffers [lucene]

2023-12-08 Thread via GitHub
gsmiller merged PR #12812: URL: https://github.com/apache/lucene/pull/12812 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.ap

Re: [PR] [Minor] Quick exit for non-zero slice buffers [lucene]

2023-12-08 Thread via GitHub
gsmiller commented on code in PR #12812: URL: https://github.com/apache/lucene/pull/12812#discussion_r1420830252 ## lucene/memory/src/java/org/apache/lucene/index/memory/MemoryIndex.java: ## @@ -179,6 +179,19 @@ static class SlicedIntBlockPool extends IntBlockPool { super

Re: [PR] Mark DrillSideways#createDrillDownFacetsCollector as @Deprecated [lucene]

2023-12-08 Thread via GitHub
gsmiller merged PR #12854: URL: https://github.com/apache/lucene/pull/12854 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.ap

Re: [PR] Remove DrillSideways#createDrillDownFacetsCollector in favor of the manager-based hook [lucene]

2023-12-08 Thread via GitHub
gsmiller merged PR #12855: URL: https://github.com/apache/lucene/pull/12855 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.ap

Re: [I] Corruption read on term dictionaries in Lucene 9.9 [lucene]

2023-12-08 Thread via GitHub
benwtrent commented on issue #12895: URL: https://github.com/apache/lucene/issues/12895#issuecomment-1847568572 Here are some exceptions ran into when trying to do multi-term queries with Lucene 9.9 against an index created in 9.8 or before: ``` Caused by: java.lang.ArrayIndexOutOf

Re: [I] Corruption read on term dictionaries in Lucene 9.9 [lucene]

2023-12-08 Thread via GitHub
benwtrent commented on issue #12895: URL: https://github.com/apache/lucene/issues/12895#issuecomment-1847562169 //cc @gf2121 && @mikemccand -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the sp

[I] Corruption read on term dictionaries in Lucene 9.9 [lucene]

2023-12-08 Thread via GitHub
benwtrent opened a new issue, #12895: URL: https://github.com/apache/lucene/issues/12895 ### Description It seems that https://github.com/apache/lucene/pull/12699/ has inadvertantly broken reading term dictionaries created in Lucene 9.8<=. To replicate a bug, one can index wiki

Re: [PR] Move group-varint encoding/decoding logic to DataOutput/DataInput [lucene]

2023-12-08 Thread via GitHub
easyice commented on code in PR #12841: URL: https://github.com/apache/lucene/pull/12841#discussion_r1420812950 ## lucene/core/src/java21/org/apache/lucene/store/MemorySegmentIndexInput.java: ## @@ -324,24 +324,9 @@ private void readGroupVInt(long[] dst, int offset) throws IOEx

[PR] Fix NPE on off-heap test and FST is null [lucene]

2023-12-08 Thread via GitHub
dungba88 opened a new pull request, #12894: URL: https://github.com/apache/lucene/pull/12894 ### Description The test can throw a NPE when it's using off-heap mode and no nodes are accepted -- This is an automated message from the Apache Git Service. To respond to the message, plea

Re: [PR] Add Facets#getBulkSpecificValues method (#12180) [lucene]

2023-12-08 Thread via GitHub
epotyom commented on PR #12862: URL: https://github.com/apache/lucene/pull/12862#issuecomment-1847540870 Thank you fore reviewing @mikemccand ! Resolved your comments in 2nd commit. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to Git

Re: [PR] Move group-varint encoding/decoding logic to DataOutput/DataInput [lucene]

2023-12-08 Thread via GitHub
uschindler commented on code in PR #12841: URL: https://github.com/apache/lucene/pull/12841#discussion_r1420760926 ## lucene/core/src/test/org/apache/lucene/store/TestMMapDirectory.java: ## @@ -114,4 +115,31 @@ public void testNullParamsIndexInput() throws Exception { }

Re: [PR] Move group-varint encoding/decoding logic to DataOutput/DataInput [lucene]

2023-12-08 Thread via GitHub
uschindler commented on code in PR #12841: URL: https://github.com/apache/lucene/pull/12841#discussion_r1420793976 ## lucene/core/src/java21/org/apache/lucene/store/MemorySegmentIndexInput.java: ## @@ -324,24 +324,9 @@ private void readGroupVInt(long[] dst, int offset) throws I

Re: [PR] Move group-varint encoding/decoding logic to DataOutput/DataInput [lucene]

2023-12-08 Thread via GitHub
uschindler commented on code in PR #12841: URL: https://github.com/apache/lucene/pull/12841#discussion_r1420793976 ## lucene/core/src/java21/org/apache/lucene/store/MemorySegmentIndexInput.java: ## @@ -324,24 +324,9 @@ private void readGroupVInt(long[] dst, int offset) throws I

Re: [PR] Add Facets#getBulkSpecificValues method (#12180) [lucene]

2023-12-08 Thread via GitHub
epotyom commented on code in PR #12862: URL: https://github.com/apache/lucene/pull/12862#discussion_r1420792687 ## lucene/facet/src/java/org/apache/lucene/facet/MultiFacets.java: ## @@ -77,6 +80,39 @@ public Number getSpecificValue(String dim, String... path) throws IOException

Re: [PR] [Minor] Quick exit for non-zero slice buffers [lucene]

2023-12-08 Thread via GitHub
stefanvodita commented on code in PR #12812: URL: https://github.com/apache/lucene/pull/12812#discussion_r1420778257 ## lucene/memory/src/java/org/apache/lucene/index/memory/MemoryIndex.java: ## @@ -179,6 +179,19 @@ static class SlicedIntBlockPool extends IntBlockPool { s

Re: [PR] Move group-varint encoding/decoding logic to DataOutput/DataInput [lucene]

2023-12-08 Thread via GitHub
uschindler commented on code in PR #12841: URL: https://github.com/apache/lucene/pull/12841#discussion_r1420760926 ## lucene/core/src/test/org/apache/lucene/store/TestMMapDirectory.java: ## @@ -114,4 +115,31 @@ public void testNullParamsIndexInput() throws Exception { }

Re: [PR] Add Facets#getBulkSpecificValues method (#12180) [lucene]

2023-12-08 Thread via GitHub
epotyom commented on code in PR #12862: URL: https://github.com/apache/lucene/pull/12862#discussion_r1420764333 ## lucene/facet/src/java/org/apache/lucene/facet/MultiFacets.java: ## @@ -77,6 +80,39 @@ public Number getSpecificValue(String dim, String... path) throws IOException

Re: [PR] Add Facets#getBulkSpecificValues method (#12180) [lucene]

2023-12-08 Thread via GitHub
epotyom commented on code in PR #12862: URL: https://github.com/apache/lucene/pull/12862#discussion_r1420764008 ## lucene/facet/src/java/org/apache/lucene/facet/MultiFacets.java: ## @@ -77,6 +80,39 @@ public Number getSpecificValue(String dim, String... path) throws IOException

Re: [PR] Add Facets#getBulkSpecificValues method (#12180) [lucene]

2023-12-08 Thread via GitHub
epotyom commented on code in PR #12862: URL: https://github.com/apache/lucene/pull/12862#discussion_r1420763643 ## lucene/facet/src/java/org/apache/lucene/facet/LongValueFacetCounts.java: ## @@ -568,6 +568,12 @@ public Number getSpecificValue(String dim, String... path) {

Re: [PR] Add Facets#getBulkSpecificValues method (#12180) [lucene]

2023-12-08 Thread via GitHub
epotyom commented on code in PR #12862: URL: https://github.com/apache/lucene/pull/12862#discussion_r1420763238 ## lucene/CHANGES.txt: ## @@ -67,6 +67,8 @@ API Changes * GITHUB#11023: Adding -level param to CheckIndex, making the old -fast param the default behaviour. (Jakub

Re: [PR] [Minor] Quick exit for non-zero slice buffers [lucene]

2023-12-08 Thread via GitHub
gsmiller commented on code in PR #12812: URL: https://github.com/apache/lucene/pull/12812#discussion_r1420739649 ## lucene/memory/src/java/org/apache/lucene/index/memory/MemoryIndex.java: ## @@ -179,6 +179,19 @@ static class SlicedIntBlockPool extends IntBlockPool { super

Re: [PR] Move group-varint encoding/decoding logic to DataOutput/DataInput [lucene]

2023-12-08 Thread via GitHub
easyice commented on code in PR #12841: URL: https://github.com/apache/lucene/pull/12841#discussion_r1420565417 ## lucene/test-framework/src/java/org/apache/lucene/tests/store/BaseDirectoryTestCase.java: ## @@ -1438,4 +1440,68 @@ public void testListAllIsSorted() throws IOExcept

Re: [PR] Optimize outputs accumulating for SegmentTermsEnum and IntersectTermsEnum [lucene]

2023-12-08 Thread via GitHub
benwtrent commented on code in PR #12699: URL: https://github.com/apache/lucene/pull/12699#discussion_r1420733099 ## lucene/core/src/java/org/apache/lucene/codecs/lucene90/blocktree/SegmentTermsEnum.java: ## @@ -1190,4 +1176,63 @@ public void seekExact(long ord) { public long

Re: [PR] Optimize outputs accumulating for SegmentTermsEnum and IntersectTermsEnum [lucene]

2023-12-08 Thread via GitHub
benwtrent commented on code in PR #12699: URL: https://github.com/apache/lucene/pull/12699#discussion_r1420733099 ## lucene/core/src/java/org/apache/lucene/codecs/lucene90/blocktree/SegmentTermsEnum.java: ## @@ -1190,4 +1176,63 @@ public void seekExact(long ord) { public long

Re: [PR] [WIP] LUCENE-10002: Deprecate FacetsCollector#search helper methods as they internally use IndexSearcher#search(Query, Collector) API [lucene]

2023-12-08 Thread via GitHub
gsmiller commented on PR #12890: URL: https://github.com/apache/lucene/pull/12890#issuecomment-1847461557 IMO we should deprecate these without replacement. I agree that users should be able to implement this logic pretty easily in their application layer, and would probably be better suite

Re: [PR] [Minor] Quick exit for non-zero slice buffers [lucene]

2023-12-08 Thread via GitHub
stefanvodita commented on code in PR #12812: URL: https://github.com/apache/lucene/pull/12812#discussion_r1420723711 ## lucene/memory/src/java/org/apache/lucene/index/memory/MemoryIndex.java: ## @@ -179,6 +179,19 @@ static class SlicedIntBlockPool extends IntBlockPool { s

Re: [PR] [Minor] Quick exit for non-zero slice buffers [lucene]

2023-12-08 Thread via GitHub
gsmiller commented on code in PR #12812: URL: https://github.com/apache/lucene/pull/12812#discussion_r1420707913 ## lucene/memory/src/java/org/apache/lucene/index/memory/MemoryIndex.java: ## @@ -179,6 +179,19 @@ static class SlicedIntBlockPool extends IntBlockPool { super

Re: [PR] [Minor] Quick exit for non-zero slice buffers [lucene]

2023-12-08 Thread via GitHub
stefanvodita commented on code in PR #12812: URL: https://github.com/apache/lucene/pull/12812#discussion_r1420699163 ## lucene/memory/src/java/org/apache/lucene/index/memory/MemoryIndex.java: ## @@ -179,6 +179,19 @@ static class SlicedIntBlockPool extends IntBlockPool { s

Re: [PR] [Minor] Quick exit for non-zero slice buffers [lucene]

2023-12-08 Thread via GitHub
gsmiller commented on code in PR #12812: URL: https://github.com/apache/lucene/pull/12812#discussion_r1420685472 ## lucene/memory/src/java/org/apache/lucene/index/memory/MemoryIndex.java: ## @@ -179,6 +179,19 @@ static class SlicedIntBlockPool extends IntBlockPool { super

Re: [PR] Fix position increment in (Reverse)PathHierarchyTokenizer [lucene]

2023-12-08 Thread via GitHub
lukas-vlcek commented on PR #12875: URL: https://github.com/apache/lucene/pull/12875#issuecomment-1847241112 @mikemccand Do you think you can give me some hint about? > (e.g. `UnifiedHighlighter`, in certain modes) I am looking at `TestUnifiedHighlighter*` tests. Does it mean th

Re: [PR] Move group-varint encoding/decoding logic to DataOutput/DataInput [lucene]

2023-12-08 Thread via GitHub
uschindler commented on code in PR #12841: URL: https://github.com/apache/lucene/pull/12841#discussion_r1420450756 ## lucene/core/src/java/org/apache/lucene/util/GroupVIntUtil.java: ## @@ -0,0 +1,104 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more +

Re: [PR] Move group-varint encoding/decoding logic to DataOutput/DataInput [lucene]

2023-12-08 Thread via GitHub
uschindler commented on code in PR #12841: URL: https://github.com/apache/lucene/pull/12841#discussion_r1420451580 ## lucene/core/src/java/org/apache/lucene/util/GroupVIntUtil.java: ## @@ -0,0 +1,104 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more +

Re: [PR] Move group-varint encoding/decoding logic to DataOutput/DataInput [lucene]

2023-12-08 Thread via GitHub
uschindler commented on code in PR #12841: URL: https://github.com/apache/lucene/pull/12841#discussion_r1420447123 ## lucene/core/src/java/org/apache/lucene/util/GroupVIntUtil.java: ## @@ -0,0 +1,104 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more +

Re: [PR] Move group-varint encoding/decoding logic to DataOutput/DataInput [lucene]

2023-12-08 Thread via GitHub
easyice commented on code in PR #12841: URL: https://github.com/apache/lucene/pull/12841#discussion_r1420418454 ## lucene/core/src/java/org/apache/lucene/util/GroupVIntUtil.java: ## @@ -0,0 +1,104 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + *

Re: [PR] Move group-varint encoding/decoding logic to DataOutput/DataInput [lucene]

2023-12-08 Thread via GitHub
uschindler commented on code in PR #12841: URL: https://github.com/apache/lucene/pull/12841#discussion_r1420387856 ## lucene/core/src/java21/org/apache/lucene/store/MemorySegmentIndexInput.java: ## @@ -303,6 +304,34 @@ public byte readByte(long pos) throws IOException { }

[PR] Removing @lucene.experimental tags in testXXX methods in CheckIndex [lucene]

2023-12-08 Thread via GitHub
slow-J opened a new pull request, #12893: URL: https://github.com/apache/lucene/pull/12893 Following up on @mikemccand's comment in previous CheckIndex PR:https://github.com/apache/lucene/pull/12876. > I do think some of these tags in CheckIndex.java could be removed, e.g. on each o

Re: [PR] Move group-varint encoding/decoding logic to DataOutput/DataInput [lucene]

2023-12-08 Thread via GitHub
easyice commented on PR #12841: URL: https://github.com/apache/lucene/pull/12841#issuecomment-1847102122 It looks good on `byteBuffers` and `MMapDirectory`, the benchmark result is pretty close to previous commit, but a bit slowdon on `NIOFSDirectory`, i will dig it. * `*ReadGroupV

Re: [PR] Move group-varint encoding/decoding logic to DataOutput/DataInput [lucene]

2023-12-08 Thread via GitHub
uschindler commented on code in PR #12841: URL: https://github.com/apache/lucene/pull/12841#discussion_r1420382794 ## lucene/core/src/java21/org/apache/lucene/store/MemorySegmentIndexInput.java: ## @@ -303,6 +304,48 @@ public byte readByte(long pos) throws IOException { }

Re: [PR] Move group-varint encoding/decoding logic to DataOutput/DataInput [lucene]

2023-12-08 Thread via GitHub
uschindler commented on code in PR #12841: URL: https://github.com/apache/lucene/pull/12841#discussion_r1420379683 ## lucene/core/src/java21/org/apache/lucene/store/MemorySegmentIndexInput.java: ## @@ -303,6 +304,34 @@ public byte readByte(long pos) throws IOException { }

Re: [PR] Move group-varint encoding/decoding logic to DataOutput/DataInput [lucene]

2023-12-08 Thread via GitHub
uschindler commented on code in PR #12841: URL: https://github.com/apache/lucene/pull/12841#discussion_r1420377148 ## lucene/core/src/java21/org/apache/lucene/store/MemorySegmentIndexInput.java: ## @@ -324,24 +324,9 @@ private void readGroupVInt(long[] dst, int offset) throws I

Re: [PR] Move group-varint encoding/decoding logic to DataOutput/DataInput [lucene]

2023-12-08 Thread via GitHub
uschindler commented on code in PR #12841: URL: https://github.com/apache/lucene/pull/12841#discussion_r1420375351 ## lucene/core/src/java/org/apache/lucene/util/GroupVIntUtil.java: ## @@ -62,4 +62,42 @@ private static long readLongInGroup(DataInput in, int numBytesMinus1) thro

Re: [PR] Move group-varint encoding/decoding logic to DataOutput/DataInput [lucene]

2023-12-08 Thread via GitHub
uschindler commented on code in PR #12841: URL: https://github.com/apache/lucene/pull/12841#discussion_r1420373977 ## lucene/core/src/java/org/apache/lucene/util/GroupVIntUtil.java: ## @@ -62,4 +62,42 @@ private static long readLongInGroup(DataInput in, int numBytesMinus1) thro

Re: [PR] Move group-varint encoding/decoding logic to DataOutput/DataInput [lucene]

2023-12-08 Thread via GitHub
uschindler commented on code in PR #12841: URL: https://github.com/apache/lucene/pull/12841#discussion_r1420364348 ## lucene/test-framework/src/java/org/apache/lucene/tests/store/BaseDirectoryTestCase.java: ## @@ -1438,4 +1440,68 @@ public void testListAllIsSorted() throws IOExc

Re: [PR] Move group-varint encoding/decoding logic to DataOutput/DataInput [lucene]

2023-12-08 Thread via GitHub
uschindler commented on code in PR #12841: URL: https://github.com/apache/lucene/pull/12841#discussion_r1420364348 ## lucene/test-framework/src/java/org/apache/lucene/tests/store/BaseDirectoryTestCase.java: ## @@ -1438,4 +1440,68 @@ public void testListAllIsSorted() throws IOExc

Re: [PR] Move group-varint encoding/decoding logic to DataOutput/DataInput [lucene]

2023-12-08 Thread via GitHub
easyice commented on PR #12841: URL: https://github.com/apache/lucene/pull/12841#issuecomment-1847060669 I'm running the performance differences between previous commit, it will take a moment. -- This is an automated message from the Apache Git Service. To respond to the message, please l

Re: [PR] Enable CheckIndex to exorcise segments with missing segment infos (.si) (#7820) [lucene]

2023-12-08 Thread via GitHub
gokaai commented on code in PR #12872: URL: https://github.com/apache/lucene/pull/12872#discussion_r1420277534 ## lucene/core/src/java/org/apache/lucene/index/CheckIndex.java: ## @@ -957,6 +974,9 @@ private Status.SegmentInfoStatus testSegment( SegmentReader reader = null;

Re: [PR] Move group-varint encoding/decoding logic to DataOutput/DataInput [lucene]

2023-12-08 Thread via GitHub
easyice commented on PR #12841: URL: https://github.com/apache/lucene/pull/12841#issuecomment-1846907147 > Can we do the same for all other inputs? I think so, i will do this if @jpountz doesn't mind. > I will nag Maurizio again about the problem with slice(). Thank you s

Re: [PR] Correct last remaining instances of typo e.g. "Levenstein" -> "Levenshtein" [lucene]

2023-12-08 Thread via GitHub
shaikhu commented on PR #12519: URL: https://github.com/apache/lucene/pull/12519#issuecomment-1846884628 Oops I completely forgot about this. Restored forked repo and reopening. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub

Re: [PR] Move group-varint encoding/decoding logic to DataOutput/DataInput [lucene]

2023-12-08 Thread via GitHub
uschindler commented on PR #12841: URL: https://github.com/apache/lucene/pull/12841#issuecomment-1846883907 I will nag Maurizio again about the problem with slice(). The reason for this was some strange problem with Hotspot. I thought they fixed it. -- This is an automated message from th

Re: [PR] Move group-varint encoding/decoding logic to DataOutput/DataInput [lucene]

2023-12-08 Thread via GitHub
uschindler commented on PR #12841: URL: https://github.com/apache/lucene/pull/12841#issuecomment-1846881254 I would still be safe and initialize the IntReader on construction of the IndexInput. It can strongly bind to the current segment. Can we do the same for all other inputs? --

Re: [PR] Move group-varint encoding/decoding logic to DataOutput/DataInput [lucene]

2023-12-08 Thread via GitHub
easyice commented on PR #12841: URL: https://github.com/apache/lucene/pull/12841#issuecomment-1846784038 +1 for gc overhead, here is the gc output (`-prof gc` ): ``` Benchmark (size) Mode Cnt Score Er

Re: [PR] Move group-varint encoding/decoding logic to DataOutput/DataInput [lucene]

2023-12-08 Thread via GitHub
jpountz commented on PR #12841: URL: https://github.com/apache/lucene/pull/12841#issuecomment-1846779641 I confirmed there's GC activity happening with the slice approach by using `-prof gc`: ``` Benchmark (si

Re: [PR] Move group-varint encoding/decoding logic to DataOutput/DataInput [lucene]

2023-12-08 Thread via GitHub
jpountz commented on PR #12841: URL: https://github.com/apache/lucene/pull/12841#issuecomment-1846774383 I'll check if there is GC activity during the benchmark. In the meantime, I looked into using lambdas instead, and it seems like it would work well: https://github.com/apache/lucene/comm

Re: [PR] [Draft] Fix for the bug where JapaneseReadingFormFilter cannot convert some hiragana to romaji [lucene]

2023-12-08 Thread via GitHub
kuramitsu commented on PR #12885: URL: https://github.com/apache/lucene/pull/12885#issuecomment-1846729364 The modification within the getRomanization function has been dropped. Instead, in the incrementToken function, I added a process to treat the hiragana OOV term converted to kataka a