[GitHub] [lucene] mayya-sharipova commented on a diff in pull request #1071: LUCENE-9583: Remove RandomAccessVectorValuesProducer

2022-08-18 Thread GitBox


mayya-sharipova commented on code in PR #1071:
URL: https://github.com/apache/lucene/pull/1071#discussion_r949691151


##
lucene/core/src/java/org/apache/lucene/index/VectorValues.java:
##
@@ -192,36 +176,5 @@ public int advance(int target) throws IOException {
 public long cost() {
   return size();
 }
-
-@Override
-public RandomAccessVectorValues randomAccess() throws IOException {

Review Comment:
   Very nice simplification!



##
lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraphBuilder.java:
##
@@ -133,7 +132,7 @@ private HnswGraphBuilder(
* accessor for the vectors
*/
   public OnHeapHnswGraph build(RandomAccessVectorValues vectors) throws 
IOException {
-if (vectors == vectorValues) {
+if (vectors == this.vectors) {

Review Comment:
   may be call a function parameter something different from `vectors` (e.g. 
`pvectors`), otherwise gets confused with `this.vectors`?
And also in `addVectors` function



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Resolved] (LUCENE-10681) ArrayIndexOutOfBoundsException while indexing large binary file

2022-08-18 Thread Jira


 [ 
https://issues.apache.org/jira/browse/LUCENE-10681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Luís Filipe Nassif resolved LUCENE-10681.
-
Resolution: Duplicate

Seems a duplicate of LUCENE-8118

> ArrayIndexOutOfBoundsException while indexing large binary file
> ---
>
> Key: LUCENE-10681
> URL: https://issues.apache.org/jira/browse/LUCENE-10681
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/index
>Affects Versions: 9.2
> Environment: Ubuntu 20.04 (LTS), java x64 version 11.0.16.1
>Reporter: Luís Filipe Nassif
>Priority: Major
>
> Hello,
> I looked for a similar issue, but didn't find one, so I'm creating this, 
> sorry if it was reported before. We upgraded from Lucene-5.5.5 to 9.2.0 
> recently and an user reported error below while indexing a huge binary file 
> in a parent-children schema where strings extracted from the huge binary file 
> (using strings command) are indexed as thousands of ~10MB children text docs 
> of the parent metadata document:
>  
> {noformat}
> Caused by: java.lang.ArrayIndexOutOfBoundsException: Index -65536 out of 
> bounds for length 71428
>     at 
> org.apache.lucene.index.TermsHashPerField.writeByte(TermsHashPerField.java:219)
>  ~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b - 
> romseygeek - 2022-05-19 15:10:13]
>     at 
> org.apache.lucene.index.TermsHashPerField.writeVInt(TermsHashPerField.java:241)
>  ~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b - 
> romseygeek - 2022-05-19 15:10:13]
>     at 
> org.apache.lucene.index.FreqProxTermsWriterPerField.writeProx(FreqProxTermsWriterPerField.java:86)
>  ~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b - 
> romseygeek - 2022-05-19 15:10:13]
>     at 
> org.apache.lucene.index.FreqProxTermsWriterPerField.newTerm(FreqProxTermsWriterPerField.java:127)
>  ~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b - 
> romseygeek - 2022-05-19 15:10:13]
>     at 
> org.apache.lucene.index.TermsHashPerField.initStreamSlices(TermsHashPerField.java:175)
>  ~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b - 
> romseygeek - 2022-05-19 15:10:13]
>     at 
> org.apache.lucene.index.TermsHashPerField.add(TermsHashPerField.java:198) 
> ~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b - 
> romseygeek - 2022-05-19 15:10:13]
>     at 
> org.apache.lucene.index.IndexingChain$PerField.invert(IndexingChain.java:1224)
>  ~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b - 
> romseygeek - 2022-05-19 15:10:13]
>     at 
> org.apache.lucene.index.IndexingChain.processField(IndexingChain.java:729) 
> ~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b - 
> romseygeek - 2022-05-19 15:10:13]
>     at 
> org.apache.lucene.index.IndexingChain.processDocument(IndexingChain.java:620) 
> ~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b - 
> romseygeek - 2022-05-19 15:10:13]
>     at 
> org.apache.lucene.index.DocumentsWriterPerThread.updateDocuments(DocumentsWriterPerThread.java:241)
>  ~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b - 
> romseygeek - 2022-05-19 15:10:13]
>     at 
> org.apache.lucene.index.DocumentsWriter.updateDocuments(DocumentsWriter.java:432)
>  ~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b - 
> romseygeek - 2022-05-19 15:10:13]
>     at 
> org.apache.lucene.index.IndexWriter.updateDocuments(IndexWriter.java:1532) 
> ~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b - 
> romseygeek - 2022-05-19 15:10:13]
>     at 
> org.apache.lucene.index.IndexWriter.addDocuments(IndexWriter.java:1503) 
> ~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b - 
> romseygeek - 2022-05-19 15:10:13]
>     at iped.engine.task.index.IndexTask.process(IndexTask.java:148) 
> ~[iped-engine-4.0.2.jar:?]
>     at 
> iped.engine.task.AbstractTask.processMonitorTimeout(AbstractTask.java:250) 
> ~[iped-engine-4.0.2.jar:?]{noformat}
>  
> This seems an integer overflow to me, not sure... It didn't use to happen 
> with previous lucene-5.5.5 and indexing files like this is pretty common to 
> us, although with lucene-5.5.5 we used to break that huge file manually 
> before indexing and to index using IndexWriter.addDocument(Document) method 
> several times for each 10MB chunk, now we are using the 
> IndexWriter.addDocuments(Iterable) method with lucene-9.2.0... Any thoughts?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Comment Edited] (LUCENE-8118) ArrayIndexOutOfBoundsException in TermsHashPerField.writeByte during indexing

2022-08-18 Thread Jira


[ 
https://issues.apache.org/jira/browse/LUCENE-8118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17581478#comment-17581478
 ] 

Luís Filipe Nassif edited comment on LUCENE-8118 at 8/18/22 6:37 PM:
-

Hi, a colleague of mine pointed this to me. Should I close 
https://issues.apache.org/jira/browse/LUCENE-10681 as duplicate?

We hit this AIOOBE in the 640th iteration of addDocumentS(Iterable) with ~10MB 
sized docs. Is there a known upper bound for numDocs x docSize given to 
addDocumentS()?

PS: possibly there were other documents being indexed in parallel by other 
threads

PS2: our default commit time interval is 30min

PS3: I changed our application from addDocument() to addDocumentS() in part 
because of the nice atomic guarantees and because we have to have all text 
chunks children of one parent document. If we have to call addDocumentS() 
multiple times with smaller iterables, possibly we will have to implement the 
parent-children control by ourselves (as we did in the past with the first 
method)... or not?


was (Author: lfcnassif):
Hi, a colleague of mine pointed this to me. Should I close 
https://issues.apache.org/jira/browse/LUCENE-10681 as duplicate?

We hit this AIOOBE in the 640th iteration of addDocumentS(Iterable) with ~10MB 
sized docs. Is there a known upper bound for numDocs x docSize given to 
addDocumentS()?

PS: possibly there were other documents being indexed in parallel by other 
threads

PS2: our default commit time interval is 30min

> ArrayIndexOutOfBoundsException in TermsHashPerField.writeByte during indexing
> -
>
> Key: LUCENE-8118
> URL: https://issues.apache.org/jira/browse/LUCENE-8118
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/index
>Affects Versions: 7.2
> Environment: Debian/Stretch
> java version "1.8.0_144"  
>   
>Java(TM) SE Runtime 
> Environment (build 1.8.0_144-b01) 
>   
>Java HotSpot(TM) 64-Bit Server VM (build 
> 25.144-b01, mixed mode)
>Reporter: Laura Dietz
>Priority: Major
> Attachments: LUCENE-8118_test.patch
>
>  Time Spent: 2h 40m
>  Remaining Estimate: 0h
>
> Indexing a large collection of about 20 million paragraph-sized documents 
> results in an ArrayIndexOutOfBoundsException in 
> org.apache.lucene.index.TermsHashPerField.writeByte  (full stack trace 
> below). 
> The bug is possibly related to issues described in 
> [here|http://lucene.472066.n3.nabble.com/ArrayIndexOutOfBoundsException-65536-td3661945.html]
>   and [SOLR-10936|https://issues.apache.org/jira/browse/SOLR-10936] -- but I 
> am not using SOLR, I am directly using Lucene Core.
> The issue can be reproduced using code from  [GitHub 
> trec-car-tools-example|https://github.com/TREMA-UNH/trec-car-tools/tree/lucene-bug/trec-car-tools-example]
>  
> - compile with `mvn compile assembly:single`
> - run with `java -cp 
> ./target/treccar-tools-example-0.1-jar-with-dependencies.jar 
> edu.unh.cs.TrecCarBuildLuceneIndex paragraphs paragraphCorpus.cbor indexDir`
> Where paragraphCorpus.cbor is contained in this 
> [archive|http://trec-car.cs.unh.edu/datareleases/v2.0-snapshot/archive-paragraphCorpus.tar.xz]
> Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: -65536   
> at 
> org.apache.lucene.index.TermsHashPerField.writeByte(TermsHashPerField.java:198)
>   
>at 
> org.apache.lucene.index.TermsHashPerField.writeVInt(TermsHashPerField.java:224)
>   
>at 
> org.apache.lucene.index.FreqProxTermsWriterPerField.addTerm(FreqProxTermsWriterPerField.java:159)
>   
>  at 
> org.apache.lucene.index.TermsHashPerField.add(TermsHashPerField.java:185) 
>   
> at 
> org.apache.lucene.index.DefaultIndexingChain$PerField.invert(DefaultIndexingChain.java:786)
>   
>at 
> 

[jira] [Comment Edited] (LUCENE-8118) ArrayIndexOutOfBoundsException in TermsHashPerField.writeByte during indexing

2022-08-18 Thread Jira


[ 
https://issues.apache.org/jira/browse/LUCENE-8118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17581478#comment-17581478
 ] 

Luís Filipe Nassif edited comment on LUCENE-8118 at 8/18/22 6:24 PM:
-

Hi, a colleague of mine pointed this to me. Should I close 
https://issues.apache.org/jira/browse/LUCENE-10681 as duplicate?

We hit this AIOOBE in the 640th iteration of addDocumentS(Iterable) with ~10MB 
sized docs. Is there a known upper bound for numDocs x docSize given to 
addDocumentS()?

PS: possibly there were other documents being indexed in parallel by other 
threads

PS2: our default commit time interval is 30min


was (Author: lfcnassif):
Hi, a colleague of mine pointed this to me. Should I close 
https://issues.apache.org/jira/browse/LUCENE-10681 as duplicate?

We hit this AIOOBE in the 640th iteration of addDocumentS(Iterable) with ~10MB 
sized docs. Is there a reasonable numDocs x docSize limit for addDocumentS()?

PS: possibly there were other documents being indexed in parallel by other 
threads

PS2: our default commit time interval is 30min

> ArrayIndexOutOfBoundsException in TermsHashPerField.writeByte during indexing
> -
>
> Key: LUCENE-8118
> URL: https://issues.apache.org/jira/browse/LUCENE-8118
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/index
>Affects Versions: 7.2
> Environment: Debian/Stretch
> java version "1.8.0_144"  
>   
>Java(TM) SE Runtime 
> Environment (build 1.8.0_144-b01) 
>   
>Java HotSpot(TM) 64-Bit Server VM (build 
> 25.144-b01, mixed mode)
>Reporter: Laura Dietz
>Priority: Major
> Attachments: LUCENE-8118_test.patch
>
>  Time Spent: 2h 40m
>  Remaining Estimate: 0h
>
> Indexing a large collection of about 20 million paragraph-sized documents 
> results in an ArrayIndexOutOfBoundsException in 
> org.apache.lucene.index.TermsHashPerField.writeByte  (full stack trace 
> below). 
> The bug is possibly related to issues described in 
> [here|http://lucene.472066.n3.nabble.com/ArrayIndexOutOfBoundsException-65536-td3661945.html]
>   and [SOLR-10936|https://issues.apache.org/jira/browse/SOLR-10936] -- but I 
> am not using SOLR, I am directly using Lucene Core.
> The issue can be reproduced using code from  [GitHub 
> trec-car-tools-example|https://github.com/TREMA-UNH/trec-car-tools/tree/lucene-bug/trec-car-tools-example]
>  
> - compile with `mvn compile assembly:single`
> - run with `java -cp 
> ./target/treccar-tools-example-0.1-jar-with-dependencies.jar 
> edu.unh.cs.TrecCarBuildLuceneIndex paragraphs paragraphCorpus.cbor indexDir`
> Where paragraphCorpus.cbor is contained in this 
> [archive|http://trec-car.cs.unh.edu/datareleases/v2.0-snapshot/archive-paragraphCorpus.tar.xz]
> Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: -65536   
> at 
> org.apache.lucene.index.TermsHashPerField.writeByte(TermsHashPerField.java:198)
>   
>at 
> org.apache.lucene.index.TermsHashPerField.writeVInt(TermsHashPerField.java:224)
>   
>at 
> org.apache.lucene.index.FreqProxTermsWriterPerField.addTerm(FreqProxTermsWriterPerField.java:159)
>   
>  at 
> org.apache.lucene.index.TermsHashPerField.add(TermsHashPerField.java:185) 
>   
> at 
> org.apache.lucene.index.DefaultIndexingChain$PerField.invert(DefaultIndexingChain.java:786)
>   
>at 
> org.apache.lucene.index.DefaultIndexingChain.processField(DefaultIndexingChain.java:430)
>   
>   at 
> org.apache.lucene.index.DefaultIndexingChain.processDocument(DefaultIndexingChain.java:392)
>   
>at 
> 

[jira] [Comment Edited] (LUCENE-8118) ArrayIndexOutOfBoundsException in TermsHashPerField.writeByte during indexing

2022-08-18 Thread Jira


[ 
https://issues.apache.org/jira/browse/LUCENE-8118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17581478#comment-17581478
 ] 

Luís Filipe Nassif edited comment on LUCENE-8118 at 8/18/22 6:19 PM:
-

Hi, a colleague of mine pointed this to me. Should I close 
https://issues.apache.org/jira/browse/LUCENE-10681 as duplicate?

We hit this AIOOBE in the 640th iteration of addDocumentS(Iterable) with ~10MB 
sized docs. Is there a reasonable numDocs x docSize limit for addDocumentS()?

PS: possibly there were other documents being indexed in parallel by other 
threads

PS2: our default commit time interval is 30min


was (Author: lfcnassif):
Hi, a colleague of mine pointed this to me. Should I close 
https://issues.apache.org/jira/browse/LUCENE-10681 as duplicate?

We hit this AIOOBE in the 640th iteration of addDocumentS(Iterable) with ~10MB 
sized docs. Is there a reasonable numDocs x docSize limit for addDocumentS()?

PS: possibly there were other documents being indexed in parallel by other 
threads

> ArrayIndexOutOfBoundsException in TermsHashPerField.writeByte during indexing
> -
>
> Key: LUCENE-8118
> URL: https://issues.apache.org/jira/browse/LUCENE-8118
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/index
>Affects Versions: 7.2
> Environment: Debian/Stretch
> java version "1.8.0_144"  
>   
>Java(TM) SE Runtime 
> Environment (build 1.8.0_144-b01) 
>   
>Java HotSpot(TM) 64-Bit Server VM (build 
> 25.144-b01, mixed mode)
>Reporter: Laura Dietz
>Priority: Major
> Attachments: LUCENE-8118_test.patch
>
>  Time Spent: 2h 40m
>  Remaining Estimate: 0h
>
> Indexing a large collection of about 20 million paragraph-sized documents 
> results in an ArrayIndexOutOfBoundsException in 
> org.apache.lucene.index.TermsHashPerField.writeByte  (full stack trace 
> below). 
> The bug is possibly related to issues described in 
> [here|http://lucene.472066.n3.nabble.com/ArrayIndexOutOfBoundsException-65536-td3661945.html]
>   and [SOLR-10936|https://issues.apache.org/jira/browse/SOLR-10936] -- but I 
> am not using SOLR, I am directly using Lucene Core.
> The issue can be reproduced using code from  [GitHub 
> trec-car-tools-example|https://github.com/TREMA-UNH/trec-car-tools/tree/lucene-bug/trec-car-tools-example]
>  
> - compile with `mvn compile assembly:single`
> - run with `java -cp 
> ./target/treccar-tools-example-0.1-jar-with-dependencies.jar 
> edu.unh.cs.TrecCarBuildLuceneIndex paragraphs paragraphCorpus.cbor indexDir`
> Where paragraphCorpus.cbor is contained in this 
> [archive|http://trec-car.cs.unh.edu/datareleases/v2.0-snapshot/archive-paragraphCorpus.tar.xz]
> Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: -65536   
> at 
> org.apache.lucene.index.TermsHashPerField.writeByte(TermsHashPerField.java:198)
>   
>at 
> org.apache.lucene.index.TermsHashPerField.writeVInt(TermsHashPerField.java:224)
>   
>at 
> org.apache.lucene.index.FreqProxTermsWriterPerField.addTerm(FreqProxTermsWriterPerField.java:159)
>   
>  at 
> org.apache.lucene.index.TermsHashPerField.add(TermsHashPerField.java:185) 
>   
> at 
> org.apache.lucene.index.DefaultIndexingChain$PerField.invert(DefaultIndexingChain.java:786)
>   
>at 
> org.apache.lucene.index.DefaultIndexingChain.processField(DefaultIndexingChain.java:430)
>   
>   at 
> org.apache.lucene.index.DefaultIndexingChain.processDocument(DefaultIndexingChain.java:392)
>   
>at 
> 

[jira] [Comment Edited] (LUCENE-8118) ArrayIndexOutOfBoundsException in TermsHashPerField.writeByte during indexing

2022-08-18 Thread Jira


[ 
https://issues.apache.org/jira/browse/LUCENE-8118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17581478#comment-17581478
 ] 

Luís Filipe Nassif edited comment on LUCENE-8118 at 8/18/22 6:17 PM:
-

Hi, a colleague of mine pointed this to me. Should I close 
https://issues.apache.org/jira/browse/LUCENE-10681 as duplicate?

We hit this AIOOBE in the 640th iteration of addDocumentS(Iterable) with ~10MB 
sized docs. Is there a reasonable numDocs x docSize limit for addDocumentS()?

PS: possibly there were other documents being indexed in parallel by other 
threads


was (Author: lfcnassif):
Hi, a colleague of mine pointed this to me. Should I close 
https://issues.apache.org/jira/browse/LUCENE-10681 as duplicate?

We hit this AIOOBE in the 640th iteration of addDocumentS(Iterable) with ~10MB 
sized docs. Is there a reasonable numDocs x docDize limit for addDocumentS()?

PS: possibly there were other documents being indexed in parallel by other 
threads

> ArrayIndexOutOfBoundsException in TermsHashPerField.writeByte during indexing
> -
>
> Key: LUCENE-8118
> URL: https://issues.apache.org/jira/browse/LUCENE-8118
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/index
>Affects Versions: 7.2
> Environment: Debian/Stretch
> java version "1.8.0_144"  
>   
>Java(TM) SE Runtime 
> Environment (build 1.8.0_144-b01) 
>   
>Java HotSpot(TM) 64-Bit Server VM (build 
> 25.144-b01, mixed mode)
>Reporter: Laura Dietz
>Priority: Major
> Attachments: LUCENE-8118_test.patch
>
>  Time Spent: 2h 40m
>  Remaining Estimate: 0h
>
> Indexing a large collection of about 20 million paragraph-sized documents 
> results in an ArrayIndexOutOfBoundsException in 
> org.apache.lucene.index.TermsHashPerField.writeByte  (full stack trace 
> below). 
> The bug is possibly related to issues described in 
> [here|http://lucene.472066.n3.nabble.com/ArrayIndexOutOfBoundsException-65536-td3661945.html]
>   and [SOLR-10936|https://issues.apache.org/jira/browse/SOLR-10936] -- but I 
> am not using SOLR, I am directly using Lucene Core.
> The issue can be reproduced using code from  [GitHub 
> trec-car-tools-example|https://github.com/TREMA-UNH/trec-car-tools/tree/lucene-bug/trec-car-tools-example]
>  
> - compile with `mvn compile assembly:single`
> - run with `java -cp 
> ./target/treccar-tools-example-0.1-jar-with-dependencies.jar 
> edu.unh.cs.TrecCarBuildLuceneIndex paragraphs paragraphCorpus.cbor indexDir`
> Where paragraphCorpus.cbor is contained in this 
> [archive|http://trec-car.cs.unh.edu/datareleases/v2.0-snapshot/archive-paragraphCorpus.tar.xz]
> Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: -65536   
> at 
> org.apache.lucene.index.TermsHashPerField.writeByte(TermsHashPerField.java:198)
>   
>at 
> org.apache.lucene.index.TermsHashPerField.writeVInt(TermsHashPerField.java:224)
>   
>at 
> org.apache.lucene.index.FreqProxTermsWriterPerField.addTerm(FreqProxTermsWriterPerField.java:159)
>   
>  at 
> org.apache.lucene.index.TermsHashPerField.add(TermsHashPerField.java:185) 
>   
> at 
> org.apache.lucene.index.DefaultIndexingChain$PerField.invert(DefaultIndexingChain.java:786)
>   
>at 
> org.apache.lucene.index.DefaultIndexingChain.processField(DefaultIndexingChain.java:430)
>   
>   at 
> org.apache.lucene.index.DefaultIndexingChain.processDocument(DefaultIndexingChain.java:392)
>   
>at 
> org.apache.lucene.index.DocumentsWriterPerThread.updateDocuments(DocumentsWriterPerThread.java:281)
> 

[jira] [Commented] (LUCENE-8118) ArrayIndexOutOfBoundsException in TermsHashPerField.writeByte during indexing

2022-08-18 Thread Jira


[ 
https://issues.apache.org/jira/browse/LUCENE-8118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17581478#comment-17581478
 ] 

Luís Filipe Nassif commented on LUCENE-8118:


Hi, a colleague of mine pointed this to me. Should I close 
https://issues.apache.org/jira/browse/LUCENE-10681 as duplicate?

We hit this AIOOBE in the 640th iteration of addDocumentS(Iterable) with ~10MB 
sized docs. Is there a reasonable numDocs x docDize limit for addDocumentS()?

PS: possibly there were other documents being indexed in parallel by other 
threads

> ArrayIndexOutOfBoundsException in TermsHashPerField.writeByte during indexing
> -
>
> Key: LUCENE-8118
> URL: https://issues.apache.org/jira/browse/LUCENE-8118
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/index
>Affects Versions: 7.2
> Environment: Debian/Stretch
> java version "1.8.0_144"  
>   
>Java(TM) SE Runtime 
> Environment (build 1.8.0_144-b01) 
>   
>Java HotSpot(TM) 64-Bit Server VM (build 
> 25.144-b01, mixed mode)
>Reporter: Laura Dietz
>Priority: Major
> Attachments: LUCENE-8118_test.patch
>
>  Time Spent: 2h 40m
>  Remaining Estimate: 0h
>
> Indexing a large collection of about 20 million paragraph-sized documents 
> results in an ArrayIndexOutOfBoundsException in 
> org.apache.lucene.index.TermsHashPerField.writeByte  (full stack trace 
> below). 
> The bug is possibly related to issues described in 
> [here|http://lucene.472066.n3.nabble.com/ArrayIndexOutOfBoundsException-65536-td3661945.html]
>   and [SOLR-10936|https://issues.apache.org/jira/browse/SOLR-10936] -- but I 
> am not using SOLR, I am directly using Lucene Core.
> The issue can be reproduced using code from  [GitHub 
> trec-car-tools-example|https://github.com/TREMA-UNH/trec-car-tools/tree/lucene-bug/trec-car-tools-example]
>  
> - compile with `mvn compile assembly:single`
> - run with `java -cp 
> ./target/treccar-tools-example-0.1-jar-with-dependencies.jar 
> edu.unh.cs.TrecCarBuildLuceneIndex paragraphs paragraphCorpus.cbor indexDir`
> Where paragraphCorpus.cbor is contained in this 
> [archive|http://trec-car.cs.unh.edu/datareleases/v2.0-snapshot/archive-paragraphCorpus.tar.xz]
> Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: -65536   
> at 
> org.apache.lucene.index.TermsHashPerField.writeByte(TermsHashPerField.java:198)
>   
>at 
> org.apache.lucene.index.TermsHashPerField.writeVInt(TermsHashPerField.java:224)
>   
>at 
> org.apache.lucene.index.FreqProxTermsWriterPerField.addTerm(FreqProxTermsWriterPerField.java:159)
>   
>  at 
> org.apache.lucene.index.TermsHashPerField.add(TermsHashPerField.java:185) 
>   
> at 
> org.apache.lucene.index.DefaultIndexingChain$PerField.invert(DefaultIndexingChain.java:786)
>   
>at 
> org.apache.lucene.index.DefaultIndexingChain.processField(DefaultIndexingChain.java:430)
>   
>   at 
> org.apache.lucene.index.DefaultIndexingChain.processDocument(DefaultIndexingChain.java:392)
>   
>at 
> org.apache.lucene.index.DocumentsWriterPerThread.updateDocuments(DocumentsWriterPerThread.java:281)
>   
>at 
> org.apache.lucene.index.DocumentsWriter.updateDocuments(DocumentsWriter.java:451)
>   
>  at 
> org.apache.lucene.index.IndexWriter.updateDocuments(IndexWriter.java:1532)
>   

[jira] [Resolved] (LUCENE-10644) Facets#getAllChildren testing should ignore child order

2022-08-18 Thread Greg Miller (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Greg Miller resolved LUCENE-10644.
--
Fix Version/s: 9.4
   Resolution: Fixed

> Facets#getAllChildren testing should ignore child order
> ---
>
> Key: LUCENE-10644
> URL: https://issues.apache.org/jira/browse/LUCENE-10644
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/facet
>Reporter: Greg Miller
>Priority: Minor
> Fix For: 9.4
>
> Attachments: failing tests.png
>
>  Time Spent: 2.5h
>  Remaining Estimate: 0h
>
> Our javadoc for {{Facets#getAllChildren}} explicitly calls out that callers 
> should make no assumptions about child ordering, but a number of our own unit 
> tests turn around and make that assumption. I ran into this when recently 
> trying an optimization that would result in a different child ordering for 
> {{{}getAllChildren{}}}, and found a number of unit tests that started 
> failing. I'll upload a list of what I found failing.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10644) Facets#getAllChildren testing should ignore child order

2022-08-18 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17581475#comment-17581475
 ] 

ASF subversion and git services commented on LUCENE-10644:
--

Commit 51d756b7801da5bb3e49b9f887cbf5ec4c05b0c5 in lucene's branch 
refs/heads/branch_9x from Yuting Gan
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=51d756b7801 ]

LUCENE-10644: Facets#getAllChildren testing should ignore child order (#1013)



> Facets#getAllChildren testing should ignore child order
> ---
>
> Key: LUCENE-10644
> URL: https://issues.apache.org/jira/browse/LUCENE-10644
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/facet
>Reporter: Greg Miller
>Priority: Minor
> Attachments: failing tests.png
>
>  Time Spent: 2.5h
>  Remaining Estimate: 0h
>
> Our javadoc for {{Facets#getAllChildren}} explicitly calls out that callers 
> should make no assumptions about child ordering, but a number of our own unit 
> tests turn around and make that assumption. I ran into this when recently 
> trying an optimization that would result in a different child ordering for 
> {{{}getAllChildren{}}}, and found a number of unit tests that started 
> failing. I'll upload a list of what I found failing.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10644) Facets#getAllChildren testing should ignore child order

2022-08-18 Thread Greg Miller (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17581476#comment-17581476
 ] 

Greg Miller commented on LUCENE-10644:
--

Merged and backported. Thanks!

> Facets#getAllChildren testing should ignore child order
> ---
>
> Key: LUCENE-10644
> URL: https://issues.apache.org/jira/browse/LUCENE-10644
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/facet
>Reporter: Greg Miller
>Priority: Minor
> Attachments: failing tests.png
>
>  Time Spent: 2.5h
>  Remaining Estimate: 0h
>
> Our javadoc for {{Facets#getAllChildren}} explicitly calls out that callers 
> should make no assumptions about child ordering, but a number of our own unit 
> tests turn around and make that assumption. I ran into this when recently 
> trying an optimization that would result in a different child ordering for 
> {{{}getAllChildren{}}}, and found a number of unit tests that started 
> failing. I'll upload a list of what I found failing.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10644) Facets#getAllChildren testing should ignore child order

2022-08-18 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17581463#comment-17581463
 ] 

ASF subversion and git services commented on LUCENE-10644:
--

Commit 0914b537dbfb1ecd49bfb90c27df69a67e50c327 in lucene's branch 
refs/heads/main from Yuting Gan
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=0914b537dbf ]

LUCENE-10644: Facets#getAllChildren testing should ignore child order (#1013)



> Facets#getAllChildren testing should ignore child order
> ---
>
> Key: LUCENE-10644
> URL: https://issues.apache.org/jira/browse/LUCENE-10644
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/facet
>Reporter: Greg Miller
>Priority: Minor
> Attachments: failing tests.png
>
>  Time Spent: 2.5h
>  Remaining Estimate: 0h
>
> Our javadoc for {{Facets#getAllChildren}} explicitly calls out that callers 
> should make no assumptions about child ordering, but a number of our own unit 
> tests turn around and make that assumption. I ran into this when recently 
> trying an optimization that would result in a different child ordering for 
> {{{}getAllChildren{}}}, and found a number of unit tests that started 
> failing. I'll upload a list of what I found failing.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] gsmiller merged pull request #1013: LUCENE-10644: Facets#getAllChildren testing should ignore child order

2022-08-18 Thread GitBox


gsmiller merged PR #1013:
URL: https://github.com/apache/lucene/pull/1013


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Comment Edited] (LUCENE-10454) UnifiedHighlighter can miss terms because of query rewrites

2022-08-18 Thread Julie Tibshirani (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17581430#comment-17581430
 ] 

Julie Tibshirani edited comment on LUCENE-10454 at 8/18/22 4:31 PM:


This popped up again in LUCENE-10680 (which we closed as a duplicate of this 
one).


was (Author: julietibs):
This popped up again in LUCENE-10454 (which we closed as a duplicate of this 
one).

> UnifiedHighlighter can miss terms because of query rewrites
> ---
>
> Key: LUCENE-10454
> URL: https://issues.apache.org/jira/browse/LUCENE-10454
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Julie Tibshirani
>Priority: Minor
> Attachments: LUCENE-10454-fix.patch, LUCENE-10454.patch
>
>
> Before extracting terms from a query, UnifiedHighlighter rewrites the query 
> using an empty searcher. If the query rewrites to MatchNoDocsQuery when the 
> reader is empty, then the highlighter will fail to extract terms. This is 
> more of an issue now that we rewrite BooleanQuery to MatchNoDocsQuery when 
> any of its required clauses is MatchNoDocsQuery 
> (https://issues.apache.org/jira/browse/LUCENE-10412). I attached a patch 
> showing the problem.
> This feels like a pretty esoteric issue, but I figured it was worth raising 
> for awareness. I think it only applies when weightMatches=false, which isn't 
> the default. I couldn't find any existing queries in Lucene that would be 
> affected.
> We ran into it while upgrading Elasticsearch to the latest Lucene snapshot, 
> since a couple custom queries rewrite to MatchNoDocsQuery when the reader is 
> empty.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10454) UnifiedHighlighter can miss terms because of query rewrites

2022-08-18 Thread Julie Tibshirani (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17581430#comment-17581430
 ] 

Julie Tibshirani commented on LUCENE-10454:
---

This popped up again in LUCENE-10454 (which we closed as a duplicate of this 
one).

> UnifiedHighlighter can miss terms because of query rewrites
> ---
>
> Key: LUCENE-10454
> URL: https://issues.apache.org/jira/browse/LUCENE-10454
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Julie Tibshirani
>Priority: Minor
> Attachments: LUCENE-10454-fix.patch, LUCENE-10454.patch
>
>
> Before extracting terms from a query, UnifiedHighlighter rewrites the query 
> using an empty searcher. If the query rewrites to MatchNoDocsQuery when the 
> reader is empty, then the highlighter will fail to extract terms. This is 
> more of an issue now that we rewrite BooleanQuery to MatchNoDocsQuery when 
> any of its required clauses is MatchNoDocsQuery 
> (https://issues.apache.org/jira/browse/LUCENE-10412). I attached a patch 
> showing the problem.
> This feels like a pretty esoteric issue, but I figured it was worth raising 
> for awareness. I think it only applies when weightMatches=false, which isn't 
> the default. I couldn't find any existing queries in Lucene that would be 
> affected.
> We ran into it while upgrading Elasticsearch to the latest Lucene snapshot, 
> since a couple custom queries rewrite to MatchNoDocsQuery when the reader is 
> empty.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Resolved] (LUCENE-10680) UnifiedHighlighter's term extraction not working for some query rewrites

2022-08-18 Thread Julie Tibshirani (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10680?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julie Tibshirani resolved LUCENE-10680.
---
Resolution: Duplicate

> UnifiedHighlighter's term extraction not working for some query rewrites
> 
>
> Key: LUCENE-10680
> URL: https://issues.apache.org/jira/browse/LUCENE-10680
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: modules/highlighter
>Reporter: Yannick Welsch
>Priority: Minor
>
> UnifiedHighlighter rewrites the query against an empty index when extracting 
> the terms from the query (see 
> [https://github.com/apache/lucene/blob/d5d6dc079395c47cd6d12dcce3bcfdd2c7d9dc63/lucene/highlighter/src/java/org/apache/lucene/search/uhighlight/UnifiedHighlighter.java#L149).|https://github.com/apache/lucene/blob/d5d6dc079395c47cd6d12dcce3bcfdd2c7d9dc63/lucene/highlighter/src/java/org/apache/lucene/search/uhighlight/UnifiedHighlighter.java#L149)]
> The rewrite step can unfortunately drop the terms that are to be extracted.
> Take for example the boolean query "+field:value 
> -ConstantScore(FieldExistsQuery [field=other_field])" when highlighting on 
> "field".
> The `FieldExistsQuery` rewrites on an empty index to a `MatchAllDocsQuery`, 
> and as a `MUST_NOT` clause rewrites the overall boolean query to a 
> `MatchNoDocsQuery`, dropping the `MUST` clause in the process, which means 
> that the `field:value` term is not being extracted.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10680) UnifiedHighlighter's term extraction not working for some query rewrites

2022-08-18 Thread Julie Tibshirani (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17581429#comment-17581429
 ] 

Julie Tibshirani commented on LUCENE-10680:
---

Confirmed we're okay to close in favor of LUCENE-10454.

> UnifiedHighlighter's term extraction not working for some query rewrites
> 
>
> Key: LUCENE-10680
> URL: https://issues.apache.org/jira/browse/LUCENE-10680
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: modules/highlighter
>Reporter: Yannick Welsch
>Priority: Minor
>
> UnifiedHighlighter rewrites the query against an empty index when extracting 
> the terms from the query (see 
> [https://github.com/apache/lucene/blob/d5d6dc079395c47cd6d12dcce3bcfdd2c7d9dc63/lucene/highlighter/src/java/org/apache/lucene/search/uhighlight/UnifiedHighlighter.java#L149).|https://github.com/apache/lucene/blob/d5d6dc079395c47cd6d12dcce3bcfdd2c7d9dc63/lucene/highlighter/src/java/org/apache/lucene/search/uhighlight/UnifiedHighlighter.java#L149)]
> The rewrite step can unfortunately drop the terms that are to be extracted.
> Take for example the boolean query "+field:value 
> -ConstantScore(FieldExistsQuery [field=other_field])" when highlighting on 
> "field".
> The `FieldExistsQuery` rewrites on an empty index to a `MatchAllDocsQuery`, 
> and as a `MUST_NOT` clause rewrites the overall boolean query to a 
> `MatchNoDocsQuery`, dropping the `MUST` clause in the process, which means 
> that the `field:value` term is not being extracted.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] msokolov commented on a diff in pull request #1054: LUCENE-10577: enable quantization of HNSW vectors to 8 bits

2022-08-18 Thread GitBox


msokolov commented on code in PR #1054:
URL: https://github.com/apache/lucene/pull/1054#discussion_r949257124


##
lucene/core/src/java/org/apache/lucene/document/KnnVectorField.java:
##
@@ -117,6 +160,21 @@ public KnnVectorField(String name, float[] vector, 
FieldType fieldType) {
 fieldsData = vector;
   }
 
+  /**
+   * Creates a numeric vector field. Fields are single-valued: each document 
has either one value or
+   * no value. Vectors of a single field share the same dimension and 
similarity function.
+   *
+   * @param name field name
+   * @param vector value
+   * @param fieldType field type
+   * @throws IllegalArgumentException if any parameter is null, or the vector 
is empty or has
+   * dimension  1024.
+   */
+  public KnnVectorField(String name, BytesRef vector, FieldType fieldType) {

Review Comment:
   good idea



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] msokolov commented on a diff in pull request #1054: LUCENE-10577: enable quantization of HNSW vectors to 8 bits

2022-08-18 Thread GitBox


msokolov commented on code in PR #1054:
URL: https://github.com/apache/lucene/pull/1054#discussion_r949256714


##
lucene/core/src/java/org/apache/lucene/util/VectorUtil.java:
##
@@ -213,4 +243,48 @@ public static void add(float[] u, float[] v) {
   u[i] += v[i];
 }
   }
+
+  /**
+   * Dot product computed over signed bytes.
+   *
+   * @param a bytes containing a vector
+   * @param b bytes containing another vector, of the same dimension
+   * @return the value of the dot product of the two vectors
+   */
+  public static float dotProduct(BytesRef a, BytesRef b) {
+assert a.length == b.length;
+int total = 0;
+int aOffset = a.offset, bOffset = b.offset;
+for (int i = 0; i < a.length; i++) {
+  total += a.bytes[aOffset++] * b.bytes[bOffset++];
+}
+return total;
+  }
+
+  /**
+   * Dot product score computed over signed bytes, scaled to be in [0, 1].
+   *
+   * @param a bytes containing a vector
+   * @param b bytes containing another vector, of the same dimension
+   * @return the value of the similarity function applied to the two vectors
+   */
+  public static float dotProductScore(BytesRef a, BytesRef b) {
+// divide by 2 * 2^14 (maximum absolute value of product of 2 signed 
bytes) * len
+return (1 + dotProduct(a, b)) / (float) (a.length * (1 << 15));
+  }
+
+  /**
+   * Convert a floating point vector to an array of bytes using casting; the 
vector values should be
+   * in [-128,127]
+   *
+   * @param vector a vector
+   * @return a new BytesRef containing the vector's values cast to byte.
+   */
+  public static BytesRef toBytesRef(float[] vector) {

Review Comment:
   hmm, that seems costly to me. I think these VectorUtil methods should be 
tuned for performance not safety, and we should find a way to move the safety 
checks up to a higher-level API?



##
lucene/core/src/java/org/apache/lucene/index/VectorEncoding.java:
##
@@ -0,0 +1,45 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.index;
+
+/** The numeric datatype of the vector values. */
+public enum VectorEncoding {
+
+  /**
+   * Encodes vector using 8 bits of precision per sample. Use only with 
DOT_PRODUCT similarity.

Review Comment:
   no



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] msokolov commented on a diff in pull request #1054: LUCENE-10577: enable quantization of HNSW vectors to 8 bits

2022-08-18 Thread GitBox


msokolov commented on code in PR #1054:
URL: https://github.com/apache/lucene/pull/1054#discussion_r949255426


##
lucene/core/src/java/org/apache/lucene/codecs/KnnFieldVectorsWriter.java:
##
@@ -20,8 +20,12 @@
 import java.io.IOException;
 import org.apache.lucene.util.Accountable;
 
-/** Vectors' writer for a field */
-public abstract class KnnFieldVectorsWriter implements Accountable {
+/**
+ * Vectors' writer for a field
+ *
+ * @param  an array type; the type of vectors to be written
+ */
+public abstract class KnnFieldVectorsWriter implements Accountable {

Review Comment:
   well I guess this is the key bone of contention. TBH I have thrown up my 
hands and just want whatever we can get merged that gets us the space savings. 
There were objections to the other way, so I switched back to this way.  This 
way is more API-pure I guess, but requires more methods and API surface area. I 
don't mind if we switch it again, but I'm probably not going to drive that 
effort.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] msokolov commented on a diff in pull request #1054: LUCENE-10577: enable quantization of HNSW vectors to 8 bits

2022-08-18 Thread GitBox


msokolov commented on code in PR #1054:
URL: https://github.com/apache/lucene/pull/1054#discussion_r949250783


##
lucene/core/src/java/org/apache/lucene/search/KnnVectorQuery.java:
##
@@ -133,22 +130,21 @@ private TopDocs searchLeaf(LeafReaderContext ctx, Weight 
filterWeight) throws IO
   return NO_RESULTS;
 }
 
-BitSet bitSet = createBitSet(scorer.iterator(), liveDocs, maxDoc);
-BitSetIterator filterIterator = new BitSetIterator(bitSet, 
bitSet.cardinality());
+BitSet acceptDocs = createBitSet(scorer.iterator(), liveDocs, maxDoc);
 
-if (filterIterator.cost() <= k) {
+if (acceptDocs.cardinality() <= k) {

Review Comment:
   OK I missed this point, sorry. I'm open to re-refactoring :)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] msokolov commented on a diff in pull request #1054: LUCENE-10577: enable quantization of HNSW vectors to 8 bits

2022-08-18 Thread GitBox


msokolov commented on code in PR #1054:
URL: https://github.com/apache/lucene/pull/1054#discussion_r949249778


##
lucene/core/src/java/org/apache/lucene/codecs/KnnVectorsFormat.java:
##
@@ -76,6 +78,15 @@ public static KnnVectorsFormat forName(String name) {
   /** Returns a {@link KnnVectorsReader} to read the vectors from the index. */
   public abstract KnnVectorsReader fieldsReader(SegmentReadState state) throws 
IOException;
 
+  /**
+   * Returns the current KnnVectorsFormat version number. Indexes written 
using the format will be
+   * "stamped" with this version.
+   */
+  public int currentVersion() {

Review Comment:
   I agree this is messy -- are you suggesting the codec would provide a 
`randomVectorEncoding` method? Or is there some other way of detecting the 
Codec version?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Resolved] (LUCENE-10557) Migrate to GitHub issue from Jira

2022-08-18 Thread Tomoko Uchida (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10557?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tomoko Uchida resolved LUCENE-10557.

Resolution: Resolved

The work has already been carried over to 
https://github.com/apache/lucene-jira-archive.
I'm closing this.


> Migrate to GitHub issue from Jira
> -
>
> Key: LUCENE-10557
> URL: https://issues.apache.org/jira/browse/LUCENE-10557
> Project: Lucene - Core
>  Issue Type: Sub-task
>Reporter: Tomoko Uchida
>Assignee: Tomoko Uchida
>Priority: Major
> Attachments: Screen Shot 2022-06-29 at 11.02.35 AM.png, 
> image-2022-06-29-13-36-57-365.png, screenshot-1.png
>
>  Time Spent: 1h 40m
>  Remaining Estimate: 0h
>
> A few (not the majority) Apache projects already use the GitHub issue instead 
> of Jira. For example,
> Airflow: [https://github.com/apache/airflow/issues]
> BookKeeper: [https://github.com/apache/bookkeeper/issues]
> So I think it'd be technically possible that we move to GitHub issue. I have 
> little knowledge of how to proceed with it, I'd like to discuss whether we 
> should migrate to it, and if so, how to smoothly handle the migration.
> The major tasks would be:
>  * (/) Get a consensus about the migration among committers
>  * (/) Choose issues that should be moved to GitHub - We'll migrate all 
> issues towards an atomic switch to GitHub if no major technical obstacles 
> show up.
>  ** Discussion thread 
> [https://lists.apache.org/thread/1p3p90k5c0d4othd2ct7nj14bkrxkr12]
>  ** -Conclusion for now: We don't migrate any issues. Only new issues should 
> be opened on GitHub.-
>  ** Write a prototype migration script - the decision could be made on that. 
> Things to consider:
>  *** version numbers - labels or milestones?
>  *** add a comment/ prepend a link to the source Jira issue on github side,
>  *** add a comment/ prepend a link on the jira side to the new issue on 
> github side (for people who access jira from blogs, mailing list archives and 
> other sources that will have stale links),
>  *** convert cross-issue automatic links in comments/ descriptions (as 
> suggested by Robert),
>  *** strategy to deal with sub-issues (hierarchies),
>  *** maybe prefix (or postfix) the issue title on github side with the 
> original LUCENE-XYZ key so that it is easier to search for a particular issue 
> there?
>  *** how to deal with user IDs (author, reporter, commenters)? Do they have 
> to be github users? Will information about people not registered on github be 
> lost?
>  *** create an extra mapping file of old-issue-new-issue URLs for any 
> potential future uses.
>  *** what to do with issue numbers in git/svn commits? These could be 
> rewritten but it'd change the entire git history tree - I don't think this is 
> practical, while doable.
> * Prepare a complete migration tool
> ** See https://github.com/apache/lucene-jira-archive/issues/5 
> * Build the convention for issue label/milestone management
>  ** See [https://github.com/apache/lucene-jira-archive/issues/6]
>  ** Do some experiments on a sandbox repository 
> [https://github.com/mocobeta/sandbox-lucene-10557]
>  ** Make documentation for metadata (label/milestone) management 
>  * (/) Enable Github issue on the lucene's repository
>  ** Raise an issue on INFRA
>  ** (Create an issue-only private repository for sensitive issues if it's 
> needed and allowed)
>  ** Set a mail hook to 
> [issues@lucene.apache.org|mailto:issues@lucene.apache.org] (many thanks to 
> the general mail group name)
>  * Set a schedule for migration
>  ** See [https://github.com/apache/lucene-jira-archive/issues/7]
>  ** Give some time to committers to play around with issues/labels/milestones 
> before the actual migration
>  ** Make an announcement on the mail lists
>  ** Show some text messages when opening a new Jira issue (in issue template?)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-jira-archive] mocobeta closed issue #93: Too much text is quoted

2022-08-18 Thread GitBox


mocobeta closed issue #93: Too much text is quoted
URL: https://github.com/apache/lucene-jira-archive/issues/93


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-jira-archive] mocobeta commented on issue #93: Too much text is quoted

2022-08-18 Thread GitBox


mocobeta commented on issue #93:
URL: 
https://github.com/apache/lucene-jira-archive/issues/93#issuecomment-1219518870

   I think this is fixed in #146.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-jira-archive] mikemccand commented on issue #29: Can/should we make Jira read-only on migration to GitHub issues?

2022-08-18 Thread GitBox


mikemccand commented on issue #29:
URL: 
https://github.com/apache/lucene-jira-archive/issues/29#issuecomment-1219432675

   Wot!  Thank you @mocobeta!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-jira-archive] mocobeta commented on issue #29: Can/should we make Jira read-only on migration to GitHub issues?

2022-08-18 Thread GitBox


mocobeta commented on issue #29:
URL: 
https://github.com/apache/lucene-jira-archive/issues/29#issuecomment-1219356598

   I talked with Infra on the Slack channel.
   Jira will be read-only on Monday, August 22 at 8:00 UTC as I sent to the 
mail lists.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10662) Make LuceneTestCase to not extend from org.junit.Assert

2022-08-18 Thread Dawid Weiss (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17581246#comment-17581246
 ] 

Dawid Weiss commented on LUCENE-10662:
--

Yep, closed it just now, thanks.

> Make LuceneTestCase to not extend from org.junit.Assert
> ---
>
> Key: LUCENE-10662
> URL: https://issues.apache.org/jira/browse/LUCENE-10662
> Project: Lucene - Core
>  Issue Type: Test
>  Components: general/test
>Reporter: Marios Trivyzas
>Priority: Major
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Since *LuceneTestCase* is a very useful abstract class that can be extended 
> and used by many projects, having it extending *org.junit.Assert* limits all 
> users to exclusively use the static methods of {*}org.junit.Assert{*}. In our 
> project we want to use [https://joel-costigliola.github.io/assertj] where the 
> main method to call is *org.assertj.core.api.Assertions.assertThat* which 
> conflicts with the deprecated {*}org.junit.Assert.assertThat{*}, recognized 
> by default by the compiler. So one can only use assertj if on every call uses 
> fully qualified name for the *assertThat* method, i.e.
>  
> {code:java}
> org.assertj.core.api.Assertions.assertThat(myObj.name()).isEqualTo(expectedName)
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] dweiss commented on pull request #1049: LUCENE-10662 Make LuceneTestCase to not extend from org.junit.Assert

2022-08-18 Thread GitBox


dweiss commented on PR #1049:
URL: https://github.com/apache/lucene/pull/1049#issuecomment-1219224934

   Postponed (indefinitely?), see LUCENE-10662.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] dweiss closed pull request #1049: LUCENE-10662 Make LuceneTestCase to not extend from org.junit.Assert

2022-08-18 Thread GitBox


dweiss closed pull request #1049: LUCENE-10662 Make LuceneTestCase to not 
extend from org.junit.Assert
URL: https://github.com/apache/lucene/pull/1049


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Resolved] (LUCENE-10662) Make LuceneTestCase to not extend from org.junit.Assert

2022-08-18 Thread Dawid Weiss (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10662?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dawid Weiss resolved LUCENE-10662.
--
Resolution: Won't Do

> Make LuceneTestCase to not extend from org.junit.Assert
> ---
>
> Key: LUCENE-10662
> URL: https://issues.apache.org/jira/browse/LUCENE-10662
> Project: Lucene - Core
>  Issue Type: Test
>  Components: general/test
>Reporter: Marios Trivyzas
>Priority: Major
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Since *LuceneTestCase* is a very useful abstract class that can be extended 
> and used by many projects, having it extending *org.junit.Assert* limits all 
> users to exclusively use the static methods of {*}org.junit.Assert{*}. In our 
> project we want to use [https://joel-costigliola.github.io/assertj] where the 
> main method to call is *org.assertj.core.api.Assertions.assertThat* which 
> conflicts with the deprecated {*}org.junit.Assert.assertThat{*}, recognized 
> by default by the compiler. So one can only use assertj if on every call uses 
> fully qualified name for the *assertThat* method, i.e.
>  
> {code:java}
> org.assertj.core.api.Assertions.assertThat(myObj.name()).isEqualTo(expectedName)
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10662) Make LuceneTestCase to not extend from org.junit.Assert

2022-08-18 Thread Marios Trivyzas (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17581239#comment-17581239
 ] 

Marios Trivyzas commented on LUCENE-10662:
--

Thanks [~dweiss], Should this issue be closed then? (and then my PR)

> Make LuceneTestCase to not extend from org.junit.Assert
> ---
>
> Key: LUCENE-10662
> URL: https://issues.apache.org/jira/browse/LUCENE-10662
> Project: Lucene - Core
>  Issue Type: Test
>  Components: general/test
>Reporter: Marios Trivyzas
>Priority: Major
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Since *LuceneTestCase* is a very useful abstract class that can be extended 
> and used by many projects, having it extending *org.junit.Assert* limits all 
> users to exclusively use the static methods of {*}org.junit.Assert{*}. In our 
> project we want to use [https://joel-costigliola.github.io/assertj] where the 
> main method to call is *org.assertj.core.api.Assertions.assertThat* which 
> conflicts with the deprecated {*}org.junit.Assert.assertThat{*}, recognized 
> by default by the compiler. So one can only use assertj if on every call uses 
> fully qualified name for the *assertThat* method, i.e.
>  
> {code:java}
> org.assertj.core.api.Assertions.assertThat(myObj.name()).isEqualTo(expectedName)
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-jira-archive] mocobeta commented on issue #148: Consider renaming legacy-jira-priority: ... to priority: ...

2022-08-18 Thread GitBox


mocobeta commented on issue #148:
URL: 
https://github.com/apache/lucene-jira-archive/issues/148#issuecomment-1219185850

   The label is an intentional one. I think we mark all the Jira metadata with 
"legacy-jira-" if we don't use them in GitHub.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-jira-archive] mocobeta commented on issue #7: Make a detailed migration plan

2022-08-18 Thread GitBox


mocobeta commented on issue #7:
URL: 
https://github.com/apache/lucene-jira-archive/issues/7#issuecomment-1219177302

   I talked about the date with infra.
   The migration will start on Monday, August 22th at 8:00 UTC.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-jira-archive] vlsi opened a new issue, #148: Consider renaming legacy-jira-priority: ... to priority: ...

2022-08-18 Thread GitBox


vlsi opened a new issue, #148:
URL: https://github.com/apache/lucene-jira-archive/issues/148

   See 
https://github.com/apache/lucene-jira-archive/issues/61#issuecomment-1193923036
   
   Sample issue: https://github.com/apache/lucene/issues/1072
   
   I think `legacy-jira-` prefix makes the label longer, and it does not add 
extra value.
   If you do not intend to use the label for newly created GitHub Issues, then 
is it really worth carrying the label?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-jira-archive] mocobeta closed issue #4: Which GitHub accont we should/can use for migration?

2022-08-18 Thread GitBox


mocobeta closed issue #4: Which GitHub accont we should/can use for migration?
URL: https://github.com/apache/lucene-jira-archive/issues/4


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-jira-archive] mocobeta commented on issue #4: Which GitHub accont we should/can use for migration?

2022-08-18 Thread GitBox


mocobeta commented on issue #4:
URL: 
https://github.com/apache/lucene-jira-archive/issues/4#issuecomment-1219150542

   I tested it works.
   https://github.com/apache/lucene/issues/1070
   
   I'm closing this. Thank you everyone who gave comments on this.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-jira-archive] mocobeta commented on issue #4: Which GitHub accont we should/can use for migration?

2022-08-18 Thread GitBox


mocobeta commented on issue #4:
URL: 
https://github.com/apache/lucene-jira-archive/issues/4#issuecomment-1219146078

   Infra created this for issue migration purposes. 
https://github.com/asfimport. 
   The profile icon will be set later.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org