from:"Michael Sokolov \(Jira\)"

[jira] [Commented] (LUCENE-10318) Reuse HNSW graphs when merging segments?

2022-08-19 Thread Michael Sokolov (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17581942#comment-17581942
 ] 

Michael Sokolov commented on LUCENE-10318:
--

Another idea I played with at one point was to preserve all the graphs from the 
existing segments (remapping their ordinals) and linking them together with 
additional links. But a lot of links needed to be created in order to get close 
to the recall for a new "from scratch" graph, and I struggled to get any 
improvement. At the time I wasn't even concerning myself about deletions.

> Reuse HNSW graphs when merging segments?
> 
>
> Key: LUCENE-10318
> URL: https://issues.apache.org/jira/browse/LUCENE-10318
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Julie Tibshirani
>Priority: Major
>  Labels: vector-based-search
>
> Currently when merging segments, the HNSW vectors format rebuilds the entire 
> graph from scratch. In general, building these graphs is very expensive, and 
> it'd be nice to optimize it in any way we can. I was wondering if during 
> merge, we could choose the largest segment with no deletes, and load its HNSW 
> graph into heap. Then we'd add vectors from the other segments to this graph, 
> through the normal build process. This could cut down on the number of 
> operations we need to perform when building the graph.
> This is just an early idea, I haven't run experiments to see if it would 
> help. I'd guess that whether it helps would also depend on details of the 
> MergePolicy.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10681) ArrayIndexOutOfBoundsException while indexing large binary file

2022-08-19 Thread Michael Sokolov (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17581941#comment-17581941
 ] 

Michael Sokolov commented on LUCENE-10681:
--

I think so; please close this one if you agree

> ArrayIndexOutOfBoundsException while indexing large binary file
> ---
>
> Key: LUCENE-10681
> URL: https://issues.apache.org/jira/browse/LUCENE-10681
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/index
>Affects Versions: 9.2
> Environment: Ubuntu 20.04 (LTS), java x64 version 11.0.16.1
>Reporter: Luís Filipe Nassif
>Priority: Major
>
> Hello,
> I looked for a similar issue, but didn't find one, so I'm creating this, 
> sorry if it was reported before. We upgraded from Lucene-5.5.5 to 9.2.0 
> recently and an user reported error below while indexing a huge binary file 
> in a parent-children schema where strings extracted from the huge binary file 
> (using strings command) are indexed as thousands of ~10MB children text docs 
> of the parent metadata document:
>  
> {noformat}
> Caused by: java.lang.ArrayIndexOutOfBoundsException: Index -65536 out of 
> bounds for length 71428
>     at 
> org.apache.lucene.index.TermsHashPerField.writeByte(TermsHashPerField.java:219)
>  ~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b - 
> romseygeek - 2022-05-19 15:10:13]
>     at 
> org.apache.lucene.index.TermsHashPerField.writeVInt(TermsHashPerField.java:241)
>  ~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b - 
> romseygeek - 2022-05-19 15:10:13]
>     at 
> org.apache.lucene.index.FreqProxTermsWriterPerField.writeProx(FreqProxTermsWriterPerField.java:86)
>  ~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b - 
> romseygeek - 2022-05-19 15:10:13]
>     at 
> org.apache.lucene.index.FreqProxTermsWriterPerField.newTerm(FreqProxTermsWriterPerField.java:127)
>  ~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b - 
> romseygeek - 2022-05-19 15:10:13]
>     at 
> org.apache.lucene.index.TermsHashPerField.initStreamSlices(TermsHashPerField.java:175)
>  ~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b - 
> romseygeek - 2022-05-19 15:10:13]
>     at 
> org.apache.lucene.index.TermsHashPerField.add(TermsHashPerField.java:198) 
> ~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b - 
> romseygeek - 2022-05-19 15:10:13]
>     at 
> org.apache.lucene.index.IndexingChain$PerField.invert(IndexingChain.java:1224)
>  ~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b - 
> romseygeek - 2022-05-19 15:10:13]
>     at 
> org.apache.lucene.index.IndexingChain.processField(IndexingChain.java:729) 
> ~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b - 
> romseygeek - 2022-05-19 15:10:13]
>     at 
> org.apache.lucene.index.IndexingChain.processDocument(IndexingChain.java:620) 
> ~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b - 
> romseygeek - 2022-05-19 15:10:13]
>     at 
> org.apache.lucene.index.DocumentsWriterPerThread.updateDocuments(DocumentsWriterPerThread.java:241)
>  ~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b - 
> romseygeek - 2022-05-19 15:10:13]
>     at 
> org.apache.lucene.index.DocumentsWriter.updateDocuments(DocumentsWriter.java:432)
>  ~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b - 
> romseygeek - 2022-05-19 15:10:13]
>     at 
> org.apache.lucene.index.IndexWriter.updateDocuments(IndexWriter.java:1532) 
> ~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b - 
> romseygeek - 2022-05-19 15:10:13]
>     at 
> org.apache.lucene.index.IndexWriter.addDocuments(IndexWriter.java:1503) 
> ~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b - 
> romseygeek - 2022-05-19 15:10:13]
>     at iped.engine.task.index.IndexTask.process(IndexTask.java:148) 
> ~[iped-engine-4.0.2.jar:?]
>     at 
> iped.engine.task.AbstractTask.processMonitorTimeout(AbstractTask.java:250) 
> ~[iped-engine-4.0.2.jar:?]{noformat}
>  
> This seems an integer overflow to me, not sure... It didn't use to happen 
> with previous lucene-5.5.5 and indexing files like this is pretty common to 
> us, although with lucene-5.5.5 we used to break that huge file manually 
> before indexing and to index using IndexWriter.addDocument(Document) method 
> several times for each 10MB chunk, now we are using the 
> IndexWriter.addDocuments(Iterable) method with lucene-9.2.0... Any thoughts?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail:

[jira] [Updated] (LUCENE-10577) Enable quantization of HNSW vectors to 8 bits

2022-08-10 Thread Michael Sokolov (Jira)



 [ 
https://issues.apache.org/jira/browse/LUCENE-10577?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Sokolov updated LUCENE-10577:
-
Summary: Enable quantization of HNSW vectors to 8 bits  (was: Quantize 
vector values)

> Enable quantization of HNSW vectors to 8 bits
> -
>
> Key: LUCENE-10577
> URL: https://issues.apache.org/jira/browse/LUCENE-10577
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/codecs
>Reporter: Michael Sokolov
>Priority: Major
>  Time Spent: 5h 20m
>  Remaining Estimate: 0h
>
> The {{KnnVectorField}} api handles vectors with 4-byte floating point values. 
> These fields can be used (via {{KnnVectorsReader}}) in two main ways:
> 1. The {{VectorValues}} iterator enables retrieving values
> 2. Approximate nearest -neighbor search
> The main point of this addition was to provide the search capability, and to 
> support that it is not really necessary to store vectors in full precision. 
> Perhaps users may also be willing to retrieve values in lower precision for 
> whatever purpose those serve, if they are able to store more samples. We know 
> that 8 bits is enough to provide a very near approximation to the same 
> recall/performance tradeoff that is achieved with the full-precision vectors. 
> I'd like to explore how we could enable 4:1 compression of these fields by 
> reducing their precision.
> A few ways I can imagine this would be done:
> 1. Provide a parallel byte-oriented API. This would allow users to provide 
> their data in reduced-precision format and give control over the quantization 
> to them. It would have a major impact on the Lucene API surface though, 
> essentially requiring us to duplicate all of the vector APIs.
> 2. Automatically quantize the stored vector data when we can. This would 
> require no or perhaps very limited change to the existing API to enable the 
> feature.
> I've been exploring (2), and what I find is that we can achieve very good 
> recall results using dot-product similarity scoring by simple linear scaling 
> + quantization of the vector values, so long as  we choose the scale that 
> minimizes the quantization error. Dot-product is amenable to this treatment 
> since vectors are required to be unit-length when used with that similarity 
> function. 
>  Even still there is variability in the ideal scale over different data sets. 
> A good choice seems to be max(abs(min-value), abs(max-value)), but of course 
> this assumes that the data set doesn't have a few outlier data points. A 
> theoretical range can be obtained by 1/sqrt(dimension), but this is only 
> useful when the samples are normally distributed. We could in theory 
> determine the ideal scale when flushing a segment and manage this 
> quantization per-segment, but then numerical error could creep in when 
> merging.
> I'll post a patch/PR with an experimental setup I've been using for 
> evaluation purposes. It is pretty self-contained and simple, but has some 
> drawbacks that need to be addressed:
> 1. No automated mechanism for determining quantization scale (it's a constant 
> that I have been playing with)
> 2. Converts from byte/float when computing dot-product instead of directly 
> computing on byte values
> I'd like to get people's feedback on the approach and whether in general we 
> should think about doing this compression under the hood, or expose a 
> byte-oriented API. Whatever we do I think a 4:1 compression ratio is pretty 
> compelling and we should pursue something.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10471) Increase the number of dims for KNN vectors to 2048

2022-08-10 Thread Michael Sokolov (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17577970#comment-17577970
 ] 

Michael Sokolov commented on LUCENE-10471:
--

> Maybe I do not understand the code base of Lucene well enough, but wouldn't 
> it be possible to have a default limit of 1024 or 2028 and that one can set a 
> different limit programmable on the IndexWriter/Reader/Searcher?

I think the idea is to protect ourselves from accidental booboos; this could 
eventually get exposed in some shared configuration file, and then if somebody 
passes MAX_INT it could lead to allocating huge buffers somewhere and taking 
down a service shared by many people/groups? Hypothetical, but it's basically 
following the principle that we should be strict to help stop people shooting 
themselves and others in the feet. We may also want to preserve our ability to 
introduce optimizations that rely on some limits to the size, which would 
become difficult if usage of larger sizes became entrenched. (We can't so 
easily take it back once it's out there). Having said that I still feel a 16K 
limit, while allowing for models that are beyond reasonable, wouldn't cause any 
of these sort of issues, so that's the number I'm advocating.

> Increase the number of dims for KNN vectors to 2048
> ---
>
> Key: LUCENE-10471
> URL: https://issues.apache.org/jira/browse/LUCENE-10471
> Project: Lucene - Core
>  Issue Type: Wish
>Reporter: Mayya Sharipova
>Priority: Trivial
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> The current maximum allowed number of dimensions is equal to 1024. But we see 
> in practice a couple well-known models that produce vectors with > 1024 
> dimensions (e.g 
> [mobilenet_v2|https://tfhub.dev/google/imagenet/mobilenet_v2_035_224/feature_vector/1]
>  uses 1280d vectors, OpenAI / GPT-3 Babbage uses 2048d vectors). Increasing 
> max dims to `2048` will satisfy these use cases.
> I am wondering if anybody has strong objections against this.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Resolved] (LUCENE-10386) Add BOM module for ease of dependency management in dependent projects

2022-08-07 Thread Michael Sokolov (Jira)



 [ 
https://issues.apache.org/jira/browse/LUCENE-10386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Sokolov resolved LUCENE-10386.
--
Resolution: Won't Fix

> Add BOM module for ease of dependency management in dependent projects
> --
>
> Key: LUCENE-10386
> URL: https://issues.apache.org/jira/browse/LUCENE-10386
> Project: Lucene - Core
>  Issue Type: Wish
>  Components: general/build
>Affects Versions: 9.0, 8.4, 8.11.1
>Reporter: Petr Portnov
>Priority: Trivial
>  Labels: BOM, Dependencies
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> h1. Short description
> Add a Bill-of-Materials (BOM) module to Lucene to allow foreign projects to 
> use it for dependency management.
> h1. Reasoning
> [A lot of|https://mvnrepository.com/search?q=bom] multi-module projects are 
> providing BOMs in order to simplify dependency management. This allows 
> dependant projects to only specify the version of the BOM module while 
> declaring the dependencies without them (as the will be provided by BOM).
> For example:
> {code:groovy}
> dependencies {
> // Only specify the version of the BOM
> implementation platform('com.fasterxml.jackson:jackson-bom:2.13.1')
> // Don't specify dependency versions as they are provided by the BOM
> implementation "com.fasterxml.jackson.core:jackson-annotations"
> implementation "com.fasterxml.jackson.core:jackson-core"
> implementation "com.fasterxml.jackson.core:jackson-databind"
> implementation "com.fasterxml.jackson.dataformat:jackson-dataformat-yaml"
> implementation "com.fasterxml.jackson.datatype:jackson-datatype-guava"
> implementation "com.fasterxml.jackson.datatype:jackson-datatype-jdk8"
> implementation "com.fasterxml.jackson.datatype:jackson-datatype-jsr310"
> implementation 
> "com.fasterxml.jackson.module:jackson-module-parameter-names"
> }{code}
>  
> Not only is this approach "popular" but it also has the following pros:
>  * there is no need to declare a variable (via Maven properties or Gradle 
> ext) to hold the version
>  * this is more automation-friendly because tools like Dependabot only have 
> to update the single version per dependency group
> h1. Other suggestions
> It may be reasonable to also publish BOMs for old versions so that the 
> projects which currently rely on older Lucene versions (such as 8.4) can 
> migrate to the BOM approach without migrating to Lucene 9.0.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10386) Add BOM module for ease of dependency management in dependent projects

2022-08-06 Thread Michael Sokolov (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17576301#comment-17576301
 ] 

Michael Sokolov commented on LUCENE-10386:
--

It seems we're not going to do this – should we close the issue?

> Add BOM module for ease of dependency management in dependent projects
> --
>
> Key: LUCENE-10386
> URL: https://issues.apache.org/jira/browse/LUCENE-10386
> Project: Lucene - Core
>  Issue Type: Wish
>  Components: general/build
>Affects Versions: 9.0, 8.4, 8.11.1
>Reporter: Petr Portnov
>Priority: Trivial
>  Labels: BOM, Dependencies
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> h1. Short description
> Add a Bill-of-Materials (BOM) module to Lucene to allow foreign projects to 
> use it for dependency management.
> h1. Reasoning
> [A lot of|https://mvnrepository.com/search?q=bom] multi-module projects are 
> providing BOMs in order to simplify dependency management. This allows 
> dependant projects to only specify the version of the BOM module while 
> declaring the dependencies without them (as the will be provided by BOM).
> For example:
> {code:groovy}
> dependencies {
> // Only specify the version of the BOM
> implementation platform('com.fasterxml.jackson:jackson-bom:2.13.1')
> // Don't specify dependency versions as they are provided by the BOM
> implementation "com.fasterxml.jackson.core:jackson-annotations"
> implementation "com.fasterxml.jackson.core:jackson-core"
> implementation "com.fasterxml.jackson.core:jackson-databind"
> implementation "com.fasterxml.jackson.dataformat:jackson-dataformat-yaml"
> implementation "com.fasterxml.jackson.datatype:jackson-datatype-guava"
> implementation "com.fasterxml.jackson.datatype:jackson-datatype-jdk8"
> implementation "com.fasterxml.jackson.datatype:jackson-datatype-jsr310"
> implementation 
> "com.fasterxml.jackson.module:jackson-module-parameter-names"
> }{code}
>  
> Not only is this approach "popular" but it also has the following pros:
>  * there is no need to declare a variable (via Maven properties or Gradle 
> ext) to hold the version
>  * this is more automation-friendly because tools like Dependabot only have 
> to update the single version per dependency group
> h1. Other suggestions
> It may be reasonable to also publish BOMs for old versions so that the 
> projects which currently rely on older Lucene versions (such as 8.4) can 
> migrate to the BOM approach without migrating to Lucene 9.0.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10671) Lucene

2022-08-01 Thread Michael Sokolov (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10671?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17573918#comment-17573918
 ] 

Michael Sokolov commented on LUCENE-10671:
--

The "bad" links are still visible in the History tab - I wonder if we can erase 
them from there?

> Lucene
> --
>
> Key: LUCENE-10671
> URL: https://issues.apache.org/jira/browse/LUCENE-10671
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/hnsw
>Affects Versions: 8.11.2
>Reporter: allnewcracksoftwares
>Priority: Minor
>
> [link title|http://example.com]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10662) Make LuceneTestCase to not extend from org.junit.Assert

2022-07-29 Thread Michael Sokolov (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17573029#comment-17573029
 ] 

Michael Sokolov commented on LUCENE-10662:
--

> They can do of course, if LuceneTestCase doesn't extend org.junit.Assert, one 
> can easily define an intermediate abstract class between LuceneTestCase and 
> their actual test classes and 

Why is it necessary to break the inheritance in order to achieve what is wanted 
here? Can't we leave LuceneTestCase alone, and define MyLuceneTestCase extends 
LuceneTestCase which implements its own static methods `assertWhatever` that is 
defined to explicitly call `assertj` implementation. Then implementers of 
`MyLuceneTestCase` can get the definition they want, and no changes are 
required in Lucene. Maybe if you are not currently inheriting from 
LuceneTestCase, you must change to inherit from MyLuceneTestCase, but that is a 
small change compared to what is contemplated here.

> Make LuceneTestCase to not extend from org.junit.Assert
> ---
>
> Key: LUCENE-10662
> URL: https://issues.apache.org/jira/browse/LUCENE-10662
> Project: Lucene - Core
>  Issue Type: Test
>  Components: general/test
>Reporter: Marios Trivyzas
>Priority: Major
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Since *LuceneTestCase* is a very useful abstract class that can be extended 
> and used by many projects, having it extending *org.junit.Assert* limits all 
> users to exclusively use the static methods of {*}org.junit.Assert{*}. In our 
> project we want to use [https://joel-costigliola.github.io/assertj] where the 
> main method to call is *org.assertj.core.api.Assertions.assertThat* which 
> conflicts with the deprecated {*}org.junit.Assert.assertThat{*}, recognized 
> by default by the compiler. So one can only use assertj if on every call uses 
> fully qualified name for the *assertThat* method, i.e.
>  
> {code:java}
> org.assertj.core.api.Assertions.assertThat(myObj.name()).isEqualTo(expectedName)
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10662) Make LuceneTestCase to not extend from org.junit.Assert

2022-07-28 Thread Michael Sokolov (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17572672#comment-17572672
 ] 

Michael Sokolov commented on LUCENE-10662:
--

Is it possible to redefine a static method in a subclass? Could people who 
prefer assertj do that, in some class derived from LuceneTestCase?

> Make LuceneTestCase to not extend from org.junit.Assert
> ---
>
> Key: LUCENE-10662
> URL: https://issues.apache.org/jira/browse/LUCENE-10662
> Project: Lucene - Core
>  Issue Type: Test
>  Components: general/test
>Reporter: Marios Trivyzas
>Priority: Major
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Since *LuceneTestCase* is a very useful abstract class that can be extended 
> and used by many projects, having it extending *org.junit.Assert* limits all 
> users to exclusively use the static methods of {*}org.junit.Assert{*}. In our 
> project we want to use [https://joel-costigliola.github.io/assertj] where the 
> main method to call is *org.assertj.core.api.Assertions.assertThat* which 
> conflicts with the deprecated {*}org.junit.Assert.assertThat{*}, recognized 
> by default by the compiler. So one can only use assertj if on every call uses 
> fully qualified name for the *assertThat* method, i.e.
>  
> {code:java}
> org.assertj.core.api.Assertions.assertThat(myObj.name()).isEqualTo(expectedName)
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10404) Use hash set for visited nodes in HNSW search?

2022-07-28 Thread Michael Sokolov (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10404?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17572514#comment-17572514
 ] 

Michael Sokolov commented on LUCENE-10404:
--

Yes I think your understanding is correct - another difference is the dimension 
of the vectors which was 256 in the first case and 100 in the second. I think 
this second one will tend to emphasize the cost of the graph traversal, since 
dot-product costs will be less as a proportion of the overall time spent. 
Indeed I think some specialization would help there.

> Use hash set for visited nodes in HNSW search?
> --
>
> Key: LUCENE-10404
> URL: https://issues.apache.org/jira/browse/LUCENE-10404
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Julie Tibshirani
>Priority: Minor
>
> While searching each layer, HNSW tracks the nodes it has already visited 
> using a BitSet. We could look into using something like IntHashSet instead. I 
> tried out the idea quickly by switching to IntIntHashMap (which has already 
> been copied from hppc) and saw an improvement in index performance. 
> *Baseline:* 760896 msec to write vectors
> *Using IntIntHashMap:* 733017 msec to write vectors
> I noticed search performance actually got a little bit worse with the change 
> -- that is something to look into.
> For background, it's good to be aware that HNSW can visit a lot of nodes. For 
> example, on the glove-100-angular dataset with ~1.2 million docs, HNSW search 
> visits ~1000 - 15,000 docs depending on the recall. This number can increase 
> when searching with deleted docs, especially if you hit a "pathological" case 
> where the deleted docs happen to be closest to the query vector.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10577) Quantize vector values

2022-07-26 Thread Michael Sokolov (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17571631#comment-17571631
 ] 

Michael Sokolov commented on LUCENE-10577:
--

OK, I will revive the FieldInfo version of this thing and see about making a 
byte-oriented KnnVectorField; perhaps the VectorFormat can remain internal in 
that case. It seems likely to me that if this is a win for this algorithm that 
it could very well be so for others. Plus there is an easy fallback position 
which is to accept bytes and inflate them to four-bit floats, so the burden is 
not necessarily so great on future vector formats. Agree we can add Euclidean 
distance.

> Quantize vector values
> --
>
> Key: LUCENE-10577
> URL: https://issues.apache.org/jira/browse/LUCENE-10577
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/codecs
>Reporter: Michael Sokolov
>Priority: Major
>  Time Spent: 3h 20m
>  Remaining Estimate: 0h
>
> The {{KnnVectorField}} api handles vectors with 4-byte floating point values. 
> These fields can be used (via {{KnnVectorsReader}}) in two main ways:
> 1. The {{VectorValues}} iterator enables retrieving values
> 2. Approximate nearest -neighbor search
> The main point of this addition was to provide the search capability, and to 
> support that it is not really necessary to store vectors in full precision. 
> Perhaps users may also be willing to retrieve values in lower precision for 
> whatever purpose those serve, if they are able to store more samples. We know 
> that 8 bits is enough to provide a very near approximation to the same 
> recall/performance tradeoff that is achieved with the full-precision vectors. 
> I'd like to explore how we could enable 4:1 compression of these fields by 
> reducing their precision.
> A few ways I can imagine this would be done:
> 1. Provide a parallel byte-oriented API. This would allow users to provide 
> their data in reduced-precision format and give control over the quantization 
> to them. It would have a major impact on the Lucene API surface though, 
> essentially requiring us to duplicate all of the vector APIs.
> 2. Automatically quantize the stored vector data when we can. This would 
> require no or perhaps very limited change to the existing API to enable the 
> feature.
> I've been exploring (2), and what I find is that we can achieve very good 
> recall results using dot-product similarity scoring by simple linear scaling 
> + quantization of the vector values, so long as  we choose the scale that 
> minimizes the quantization error. Dot-product is amenable to this treatment 
> since vectors are required to be unit-length when used with that similarity 
> function. 
>  Even still there is variability in the ideal scale over different data sets. 
> A good choice seems to be max(abs(min-value), abs(max-value)), but of course 
> this assumes that the data set doesn't have a few outlier data points. A 
> theoretical range can be obtained by 1/sqrt(dimension), but this is only 
> useful when the samples are normally distributed. We could in theory 
> determine the ideal scale when flushing a segment and manage this 
> quantization per-segment, but then numerical error could creep in when 
> merging.
> I'll post a patch/PR with an experimental setup I've been using for 
> evaluation purposes. It is pretty self-contained and simple, but has some 
> drawbacks that need to be addressed:
> 1. No automated mechanism for determining quantization scale (it's a constant 
> that I have been playing with)
> 2. Converts from byte/float when computing dot-product instead of directly 
> computing on byte values
> I'd like to get people's feedback on the approach and whether in general we 
> should think about doing this compression under the hood, or expose a 
> byte-oriented API. Whatever we do I think a 4:1 compression ratio is pretty 
> compelling and we should pursue something.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Comment Edited] (LUCENE-10404) Use hash set for visited nodes in HNSW search?

2022-07-26 Thread Michael Sokolov (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10404?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17571594#comment-17571594
 ] 

Michael Sokolov edited comment on LUCENE-10404 at 7/26/22 8:01 PM:
---

Here is a test using GloVe 100-dim vectors plus much more aggressive indexing 
settings, and we can see that here the IntIntHashMap is adding cost
h3. baseline
h3. {{recall  latency nDoc    fanout  maxConn beamWidth       visited index ms}}
{{0.991    0.92   1   50      64      500     150     12068}}
{{0.996    1.11   1   100     64      500     200     0}}
{{0.999    1.45   1   200     64      500     300     0}}
{{1.000    1.94   1   400     64      500     500     0}}
{{0.955    2.53   10  50      64      500     150     463142}}
{{0.973    3.03   10  100     64      500     200     0}}
{{0.988    4.44   10  200     64      500     300     0}}
{{0.997    6.57   10  400     64      500     500     0}}
{{0.895    3.44   100 50      64      500     150     9811483}}
{{0.920    4.33   100 100     64      500     200     0}}
{{0.950    6.20   100 200     64      500     300     0}}
{{0.974    9.53   100 400     64      500     500     0}}

IntIntHashMap

{{recall  latency nDoc    fanout  maxConn beamWidth       visited index ms}}
{{0.991    1.03   1   50      64      500     150     13274}}
{{0.996    1.24   1   100     64      500     200     0}}
{{0.999    1.62   1   200     64      500     300     0}}
{{1.000    2.09   1   400     64      500     500     0}}
{{0.955    2.47   10  50      64      500     150     485131}}
{{0.973    3.31   10  100     64      500     200     0}}
{{0.988    4.66   10  200     64      500     300     0}}
{{0.997    7.26   10  400     64      500     500     0}}
{{0.895    3.58   100 50      64      500     150     10173818}}
{{0.920    4.49   100 100     64      500     200     0}}
{{0.950    6.45   100 200     64      500     300     0}}
{{0.974    9.91   100 400     64      500     500     0}}


was (Author: sokolov):
Here is a test using GloVe 100-dim vectors plus much more aggressive indexing 
settings, and we can see that here the IntIntHashMap is adding cost

h3. baseline

{{recall  latency nDocfanout  maxConn beamWidth   visited index ms
0.9910.92   1   50  64  500 150 12068
0.9961.11   1   100 64  500 200 0
0.9991.45   1   200 64  500 300 0
1.0001.94   1   400 64  500 500 0
0.9552.53   10  50  64  500 150 463142
0.9733.03   10  100 64  500 200 0
0.9884.44   10  200 64  500 300 0
0.9976.57   10  400 64  500 500 0
0.8953.44   100 50  64  500 150 9811483
0.9204.33   100 100 64  500 200 0
0.9506.20   100 200 64  500 300 0
0.9749.53   100 400 64  500 500 0}}
}}

h3. IntIntHashMap

{{recall  latency nDocfanout  maxConn beamWidth   visited index ms
0.9911.03   1   50  64  500 150 13274
0.9961.24   1   100 64  500 200 0
0.9991.62   1   200 64  500 300 0
1.0002.09   1   400 64  500 500 0
0.9552.47   10  50  64  500 150 485131
0.9733.31   10  100 64  500 200 0
0.9884.66   10  200 64  500 300 0
0.9977.26   10  400 64  500 500 0
0.8953.58   100 50  64  500 150 10173818
0.9204.49   100 100 64  500 200 0
0.9506.45   100 200 64  500 300 0
0.9749.91   100 400 64  500 500 0
}}


> Use hash set for visited nodes in HNSW search?
> --
>
> Key: LUCENE-10404
> URL: https://issues.apache.org/jira/browse/LUCENE-10404
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Julie Tibshirani
>Priority: Minor
>
> While searching each layer, HNSW tracks the nodes it has already visited 
> using a BitSet. We could look into using something like IntHashSet instead. I 
> tried out the idea quickly by switching to IntIntHashMap (which has already 
> been copied from hppc) and saw an improvement in index performance. 
> *Baseline:* 760896 msec to write vectors
> *Using IntIntHashMap:* 733017 msec to write vectors
> I noticed search performance actually got a little bit worse with the change 
> -- that is something to look into.
> For background, it's good to be aware that HNSW can visit a lot of nodes. For 
> example, on the glove-100-angular dataset with ~1.2 million docs, HNSW search 
> visits ~1000 -

[jira] [Comment Edited] (LUCENE-10404) Use hash set for visited nodes in HNSW search?

2022-07-26 Thread Michael Sokolov (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10404?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17571594#comment-17571594
 ] 

Michael Sokolov edited comment on LUCENE-10404 at 7/26/22 7:59 PM:
---

Here is a test using GloVe 100-dim vectors plus much more aggressive indexing 
settings, and we can see that here the IntIntHashMap is adding cost

h3. baseline

{{recall  latency nDocfanout  maxConn beamWidth   visited index ms
0.9910.92   1   50  64  500 150 12068
0.9961.11   1   100 64  500 200 0
0.9991.45   1   200 64  500 300 0
1.0001.94   1   400 64  500 500 0
0.9552.53   10  50  64  500 150 463142
0.9733.03   10  100 64  500 200 0
0.9884.44   10  200 64  500 300 0
0.9976.57   10  400 64  500 500 0
0.8953.44   100 50  64  500 150 9811483
0.9204.33   100 100 64  500 200 0
0.9506.20   100 200 64  500 300 0
0.9749.53   100 400 64  500 500 0}}
}}

h3. IntIntHashMap

{{recall  latency nDocfanout  maxConn beamWidth   visited index ms
0.9911.03   1   50  64  500 150 13274
0.9961.24   1   100 64  500 200 0
0.9991.62   1   200 64  500 300 0
1.0002.09   1   400 64  500 500 0
0.9552.47   10  50  64  500 150 485131
0.9733.31   10  100 64  500 200 0
0.9884.66   10  200 64  500 300 0
0.9977.26   10  400 64  500 500 0
0.8953.58   100 50  64  500 150 10173818
0.9204.49   100 100 64  500 200 0
0.9506.45   100 200 64  500 300 0
0.9749.91   100 400 64  500 500 0
}}



was (Author: sokolov):
Here is a test using GloVe 100-dim vectors plus much more aggressive indexing 
settings, and we can see that here the IntIntHashMap is adding cost

h3. baseline

{{
recall  latency nDocfanout  maxConn beamWidth   visited index ms
0.9910.92   1   50  64  500 150 12068
0.9961.11   1   100 64  500 200 0
0.9991.45   1   200 64  500 300 0
1.0001.94   1   400 64  500 500 0
0.9552.53   10  50  64  500 150 463142
0.9733.03   10  100 64  500 200 0
0.9884.44   10  200 64  500 300 0
0.9976.57   10  400 64  500 500 0
0.8953.44   100 50  64  500 150 9811483
0.9204.33   100 100 64  500 200 0
0.9506.20   100 200 64  500 300 0
0.9749.53   100 400 64  500 500 0
}}

h3. IntIntHashMap

{{
recall  latency nDocfanout  maxConn beamWidth   visited index ms
0.9911.03   1   50  64  500 150 13274
0.9961.24   1   100 64  500 200 0
0.9991.62   1   200 64  500 300 0
1.0002.09   1   400 64  500 500 0
0.9552.47   10  50  64  500 150 485131
0.9733.31   10  100 64  500 200 0
0.9884.66   10  200 64  500 300 0
0.9977.26   10  400 64  500 500 0
0.8953.58   100 50  64  500 150 10173818
0.9204.49   100 100 64  500 200 0
0.9506.45   100 200 64  500 300 0
0.9749.91   100 400 64  500 500 0
}}


> Use hash set for visited nodes in HNSW search?
> --
>
> Key: LUCENE-10404
> URL: https://issues.apache.org/jira/browse/LUCENE-10404
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Julie Tibshirani
>Priority: Minor
>
> While searching each layer, HNSW tracks the nodes it has already visited 
> using a BitSet. We could look into using something like IntHashSet instead. I 
> tried out the idea quickly by switching to IntIntHashMap (which has already 
> been copied from hppc) and saw an improvement in index performance. 
> *Baseline:* 760896 msec to write vectors
> *Using IntIntHashMap:* 733017 msec to write vectors
> I noticed search performance actually got a little bit worse with the change 
> -- that is something to look into.
> For background, it's good to be aware that HNSW can visit a lot of nodes. For 
> example, on the glove-100-angular dataset with ~1.2 million docs, HNSW search 
> visits ~1000 - 15,000 docs depending on the recall. This number can increase 
> when searching with deleted

[jira] [Commented] (LUCENE-10404) Use hash set for visited nodes in HNSW search?

2022-07-26 Thread Michael Sokolov (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10404?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17571594#comment-17571594
 ] 

Michael Sokolov commented on LUCENE-10404:
--

Here is a test using GloVe 100-dim vectors plus much more aggressive indexing 
settings, and we can see that here the IntIntHashMap is adding cost

h3. baseline

{{
recall  latency nDocfanout  maxConn beamWidth   visited index ms
0.9910.92   1   50  64  500 150 12068
0.9961.11   1   100 64  500 200 0
0.9991.45   1   200 64  500 300 0
1.0001.94   1   400 64  500 500 0
0.9552.53   10  50  64  500 150 463142
0.9733.03   10  100 64  500 200 0
0.9884.44   10  200 64  500 300 0
0.9976.57   10  400 64  500 500 0
0.8953.44   100 50  64  500 150 9811483
0.9204.33   100 100 64  500 200 0
0.9506.20   100 200 64  500 300 0
0.9749.53   100 400 64  500 500 0
}}

h3. IntIntHashMap

{{
recall  latency nDocfanout  maxConn beamWidth   visited index ms
0.9911.03   1   50  64  500 150 13274
0.9961.24   1   100 64  500 200 0
0.9991.62   1   200 64  500 300 0
1.0002.09   1   400 64  500 500 0
0.9552.47   10  50  64  500 150 485131
0.9733.31   10  100 64  500 200 0
0.9884.66   10  200 64  500 300 0
0.9977.26   10  400 64  500 500 0
0.8953.58   100 50  64  500 150 10173818
0.9204.49   100 100 64  500 200 0
0.9506.45   100 200 64  500 300 0
0.9749.91   100 400 64  500 500 0
}}


> Use hash set for visited nodes in HNSW search?
> --
>
> Key: LUCENE-10404
> URL: https://issues.apache.org/jira/browse/LUCENE-10404
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Julie Tibshirani
>Priority: Minor
>
> While searching each layer, HNSW tracks the nodes it has already visited 
> using a BitSet. We could look into using something like IntHashSet instead. I 
> tried out the idea quickly by switching to IntIntHashMap (which has already 
> been copied from hppc) and saw an improvement in index performance. 
> *Baseline:* 760896 msec to write vectors
> *Using IntIntHashMap:* 733017 msec to write vectors
> I noticed search performance actually got a little bit worse with the change 
> -- that is something to look into.
> For background, it's good to be aware that HNSW can visit a lot of nodes. For 
> example, on the glove-100-angular dataset with ~1.2 million docs, HNSW search 
> visits ~1000 - 15,000 docs depending on the recall. This number can increase 
> when searching with deleted docs, especially if you hit a "pathological" case 
> where the deleted docs happen to be closest to the query vector.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10151) Add timeout support to IndexSearcher

2022-07-26 Thread Michael Sokolov (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10151?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17571374#comment-17571374
 ] 

Michael Sokolov commented on LUCENE-10151:
--

oh, too bad. Well this feature is new, so at least no existing usage will be 
broken.

> Add timeout support to IndexSearcher
> 
>
> Key: LUCENE-10151
> URL: https://issues.apache.org/jira/browse/LUCENE-10151
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/search
>Reporter: Greg Miller
>Priority: Minor
> Fix For: 9.3
>
>  Time Spent: 3h 40m
>  Remaining Estimate: 0h
>
> I'd like to explore adding optional "timeout" capabilities to 
> {{IndexSearcher}}. This would enable users to (optionally) specify a maximum 
> time budget for search execution. If the search "times out", partial results 
> would be available.
> This idea originated on the dev list (thanks [~jpountz] for the suggestion). 
> Thread for reference: 
> [http://mail-archives.apache.org/mod_mbox/lucene-dev/202110.mbox/%3CCAL8PwkZdNGmYJopPjeXYK%3DF7rvLkWon91UEXVxMM4MeeJ3UHxQ%40mail.gmail.com%3E]
>  
> A couple things to watch out for with this change:
>  # We want to make sure it's robust to a two-phase query evaluation scenario 
> where the "approximate" step matches a large number of candidates but the 
> "confirmation" step matches very few (or none). This is a particularly tricky 
> case.
>  # We want to make sure the {{TotalHits#Relation}} reported by {{TopDocs}} is 
> {{GREATER_THAN_OR_EQUAL_TO}} if the query times out
>  # We want to make sure it plays nice with the {{LRUCache}} since it iterates 
> the query to pre-populate a {{BitSet}} when caching. That step shouldn't be 
> allowed to overrun the timeout. The proper way to handle this probably needs 
> some thought.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10404) Use hash set for visited nodes in HNSW search?

2022-07-22 Thread Michael Sokolov (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10404?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17570195#comment-17570195
 ] 

Michael Sokolov commented on LUCENE-10404:
--

The default `topK` in KnnGraphTester is 100, so these test runs are maintaining 
results queues of 100 or 150 (when searching). During indexing this is driven 
by beamWidth, and 32/64 is lower than is typical, I think. Still I think it's 
encouraging that we see gains in both searching (when the queue size is 
100-150) and indexing, when it is 32-64.

I won't be able to run more tests for a few days, but I agree that it would be 
interesting to see how the gains correlate with the queue sizes. But I was 
motivated to get some quick look! Will run some more exhaustive tests next week.

> Use hash set for visited nodes in HNSW search?
> --
>
> Key: LUCENE-10404
> URL: https://issues.apache.org/jira/browse/LUCENE-10404
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Julie Tibshirani
>Priority: Minor
>
> While searching each layer, HNSW tracks the nodes it has already visited 
> using a BitSet. We could look into using something like IntHashSet instead. I 
> tried out the idea quickly by switching to IntIntHashMap (which has already 
> been copied from hppc) and saw an improvement in index performance. 
> *Baseline:* 760896 msec to write vectors
> *Using IntIntHashMap:* 733017 msec to write vectors
> I noticed search performance actually got a little bit worse with the change 
> -- that is something to look into.
> For background, it's good to be aware that HNSW can visit a lot of nodes. For 
> example, on the glove-100-angular dataset with ~1.2 million docs, HNSW search 
> visits ~1000 - 15,000 docs depending on the recall. This number can increase 
> when searching with deleted docs, especially if you hit a "pathological" case 
> where the deleted docs happen to be closest to the query vector.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10655) can we optimize visited bitset usage in HNSW graph search/indexing?

2022-07-21 Thread Michael Sokolov (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17569457#comment-17569457
 ] 

Michael Sokolov commented on LUCENE-10655:
--

Ah, I see - I hadn't followed your investigations there closely, [~julietibs] . 
Well at least we can confirm what you had found. I'll close this now - it 
doesn't seem fruitful, and I think the hash set idea has legs.

> can we optimize visited bitset usage in HNSW graph search/indexing?
> ---
>
> Key: LUCENE-10655
> URL: https://issues.apache.org/jira/browse/LUCENE-10655
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/hnsw
>Reporter: Michael Sokolov
>Priority: Major
>
> When running {{luceneutil}}  I noticed that {{FixedBitSet.clear()}} dominates 
> the CPU profiler output. I had a few ideas:
>  # In upper graph layers, the occupied nodes are very sparse - maybe 
> {{SparseFixedBitSet}} would be a better fit for those
>  # We are caching these bitsets, but they are only used for a single search 
> (single document insert, during indexing). Should we cache across searches? 
> We would need to pool them though, and they would vary by field since fields 
> can have different numbers of vector nodes. This starts to get complex
>  # Are we sure that clearing a bitset is more efficient than allocating a new 
> one? Maybe the JDK maintains a pool of already-zeroed memory for us
> I think we could try specializing the bitset type by graph level, and then I 
> think we ought to measure the performance of allocation vs the limited reuse 
> that we currently have.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Resolved] (LUCENE-10655) can we optimize visited bitset usage in HNSW graph search/indexing?

2022-07-21 Thread Michael Sokolov (Jira)



 [ 
https://issues.apache.org/jira/browse/LUCENE-10655?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Sokolov resolved LUCENE-10655.
--
Resolution: Fixed

> can we optimize visited bitset usage in HNSW graph search/indexing?
> ---
>
> Key: LUCENE-10655
> URL: https://issues.apache.org/jira/browse/LUCENE-10655
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/hnsw
>Reporter: Michael Sokolov
>Priority: Major
>
> When running {{luceneutil}}  I noticed that {{FixedBitSet.clear()}} dominates 
> the CPU profiler output. I had a few ideas:
>  # In upper graph layers, the occupied nodes are very sparse - maybe 
> {{SparseFixedBitSet}} would be a better fit for those
>  # We are caching these bitsets, but they are only used for a single search 
> (single document insert, during indexing). Should we cache across searches? 
> We would need to pool them though, and they would vary by field since fields 
> can have different numbers of vector nodes. This starts to get complex
>  # Are we sure that clearing a bitset is more efficient than allocating a new 
> one? Maybe the JDK maintains a pool of already-zeroed memory for us
> I think we could try specializing the bitset type by graph level, and then I 
> think we ought to measure the performance of allocation vs the limited reuse 
> that we currently have.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10404) Use hash set for visited nodes in HNSW search?

2022-07-21 Thread Michael Sokolov (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10404?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17569429#comment-17569429
 ] 

Michael Sokolov commented on LUCENE-10404:
--

I tried using IntIntHashMap (mapping to 1 for visited nodes) and indeed does 
seem to be a small speedup. I haven't had a chance to run luceneutil nor look 
at profiler output, but here are some numbers from KnnGraphTester for an 
internal dataset. The numbers can be a bit noisy, but are consistently better 
for the hash map version.
h3. IntIntHashMap

{{recall  latency nDoc    fanout  maxConn beamWidth       visited index ms}}
{{0.935    0.37   1   0       16      32      100     1566}}
{{0.965    0.49   1   50      16      32      150     0}}
{{0.962    0.41   1   0       16      64      100     2655}}
{{0.982    0.57   1   50      16      64      150     0}}
{{0.941    0.38   1   0       32      32      100     1473}}
{{0.969    0.51   1   50      32      32      150     0}}
{{0.966    0.45   1   0       32      64      100     2611}}
{{0.985    0.59   1   50      32      64      150     0}}
{{0.907    0.52   10  0       16      32      100     19850}}
{{0.940    0.72   10  50      16      32      150     0}}
{{0.941    0.60   10  0       16      64      100     38614}}
{{0.966    0.84   10  50      16      64      150     0}}
{{0.916    0.55   10  0       32      32      100     19243}}
{{0.949    0.75   10  50      32      32      150     0}}
{{0.952    0.66   10  0       32      64      100     38205}}
{{0.973    0.93   10  50      32      64      150     0}}
{{0.859    0.66   100 0       16      32      100     273112}}
{{0.897    0.92   100 50      16      32      150     0}}
{{0.917    0.85   100 0       16      64      100     523325}}
{{0.946    1.06   100 50      16      64      150     0}}
{{0.874    0.80   100 0       32      32      100     274816}}
{{0.913    1.05   100 50      32      32      150     0}}
{{0.929    0.98   100 0       32      64      100     564762}}
h3. baseline

{{recall  latency nDoc    fanout  maxConn beamWidth       visited index ms}}
{{0.935    0.38   1   0       16      32      100     1614}}
{{0.965    0.50   1   50      16      32      150     0}}
{{0.962    0.45   1   0       16      64      100     2687}}
{{0.982    0.57   1   50      16      64      150     0}}
{{0.941    0.40   1   0       32      32      100     1504}}
{{0.969    0.51   1   50      32      32      150     0}}
{{0.966    0.44   1   0       32      64      100     2652}}
{{0.985    0.58   1   50      32      64      150     0}}
{{0.907    0.54   10  0       16      32      100     21449}}
{{0.940    0.74   10  50      16      32      150     0}}
{{0.941    0.64   10  0       16      64      100     39962}}
{{0.966    0.88   10  50      16      64      150     0}}
{{0.916    0.59   10  0       32      32      100     20554}}
{{0.949    0.80   10  50      32      32      150     0}}
{{0.952    0.72   10  0       32      64      100     40980}}
{{0.973    1.04   10  50      32      64      150     0}}
{{0.859    0.75   100 0       16      32      100     300514}}
{{0.897    0.96   100 50      16      32      150     0}}
{{0.917    0.84   100 0       16      64      100     563259}}
{{0.946    1.12   100 50      16      64      150     0}}
{{0.874    0.86   100 0       32      32      100     303186}}
{{0.913    1.09   100 50      32      32      150     0}}
{{0.929    1.04   100 0       32      64      100     580725}}
{{0.958    1.38   100 50      32      64      150     0}}

 

> Use hash set for visited nodes in HNSW search?
> --
>
> Key: LUCENE-10404
> URL: https://issues.apache.org/jira/browse/LUCENE-10404
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Julie Tibshirani
>Priority: Minor
>
> While searching each layer, HNSW tracks the nodes it has already visited 
> using a BitSet. We could look into using something like IntHashSet instead. I 
> tried out the idea quickly by switching to IntIntHashMap (which has already 
> been copied from hppc) and saw an improvement in index performance. 
> *Baseline:* 760896 msec to write vectors
> *Using IntIntHashMap:* 733017 msec to write vectors
> I noticed search performance actually got a little bit worse with the change 
> -- that is something to look into.
> For background, it's good to be aware that HNSW can visit a lot of nodes. For 
> example, on the glove-100-angular dataset with ~1.2 million docs, HNSW search 
> visits ~1000 - 15,000 docs depending on the recall. This number can increase 
> when searching with deleted docs, especially if you hit a "pathological" case 
> where the

[jira] (LUCENE-10404) Use hash set for visited nodes in HNSW search?

2022-07-21 Thread Michael Sokolov (Jira)



[ https://issues.apache.org/jira/browse/LUCENE-10404 ]


Michael Sokolov deleted comment on LUCENE-10404:
--

was (Author: sokolov):
I tried using IntIntHashMap (mapping to 1 for visited nodes) and indeed does 
seem to be a small speedup. I haven't had a chance to run luceneutil nor look 
at profiler output, but here are some numbers from KnnGraphTester for an 
internal dataset. The numbers can be a bit noisy, but are consistently better 
for the hash map version.
h3. IntIntHashMap

recall  latency nDoc    fanout  maxConn beamWidth       visited index ms
0.935    0.37   1   0       16      32      100     1566
0.965    0.49   1   50      16      32      150     0
0.962    0.41   1   0       16      64      100     2655
0.982    0.57   1   50      16      64      150     0
0.941    0.38   1   0       32      32      100     1473
0.969    0.51   1   50      32      32      150     0
0.966    0.45   1   0       32      64      100     2611
0.985    0.59   1   50      32      64      150     0
0.907    0.52   10  0       16      32      100     19850
0.940    0.72   10  50      16      32      150     0
0.941    0.60   10  0       16      64      100     38614
0.966    0.84   10  50      16      64      150     0
0.916    0.55   10  0       32      32      100     19243
0.949    0.75   10  50      32      32      150     0
0.952    0.66   10  0       32      64      100     38205
0.973    0.93   10  50      32      64      150     0
0.859    0.66   100 0       16      32      100     273112
0.897    0.92   100 50      16      32      150     0

{{0.917    0.85   100 0       16      64      100     523325}}
{{0.946    1.06   100 50      16      64      150     0}}



more to come – pushed ctrl-enter instead of enter ...
h3. baseline

{{recall  latency nDoc    fanout  maxConn beamWidth       visited index ms}}
{{0.935    0.38   1   0       16      32      100     1614}}
{{0.965    0.50   1   50      16      32      150     0}}
{{0.962    0.45   1   0       16      64      100     2687}}
{{0.982    0.57   1   50      16      64      150     0}}
{{0.941    0.40   1   0       32      32      100     1504}}
{{0.969    0.51   1   50      32      32      150     0}}
{{0.966    0.44   1   0       32      64      100     2652}}
{{0.985    0.58   1   50      32      64      150     0}}
{{0.907    0.54   10  0       16      32      100     21449}}
{{0.940    0.74   10  50      16      32      150     0}}
{{0.941    0.64   10  0       16      64      100     39962}}
{{0.966    0.88   10  50      16      64      150     0}}
{{0.916    0.59   10  0       32      32      100     20554}}
{{0.949    0.80   10  50      32      32      150     0}}
{{0.952    0.72   10  0       32      64      100     40980}}
{{0.973    1.04   10  50      32      64      150     0}}
{{0.859    0.75   100 0       16      32      100     300514}}
{{0.897    0.96   100 50      16      32      150     0}}
{{0.917    0.84   100 0       16      64      100     563259}}
{{0.946    1.12   100 50      16      64      150     0}}
{{0.874    0.86   100 0       32      32      100     303186}}
{{0.913    1.09   100 50      32      32      150     0}}
{{0.929    1.04   100 0       32      64      100     580725}}
{{0.958    1.38   100 50      32      64      150     0}}

> Use hash set for visited nodes in HNSW search?
> --
>
> Key: LUCENE-10404
> URL: https://issues.apache.org/jira/browse/LUCENE-10404
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Julie Tibshirani
>Priority: Minor
>
> While searching each layer, HNSW tracks the nodes it has already visited 
> using a BitSet. We could look into using something like IntHashSet instead. I 
> tried out the idea quickly by switching to IntIntHashMap (which has already 
> been copied from hppc) and saw an improvement in index performance. 
> *Baseline:* 760896 msec to write vectors
> *Using IntIntHashMap:* 733017 msec to write vectors
> I noticed search performance actually got a little bit worse with the change 
> -- that is something to look into.
> For background, it's good to be aware that HNSW can visit a lot of nodes. For 
> example, on the glove-100-angular dataset with ~1.2 million docs, HNSW search 
> visits ~1000 - 15,000 docs depending on the recall. This number can increase 
> when searching with deleted docs, especially if you hit a "pathological" case 
> where the deleted docs happen to be closest to the query vector.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Comment Edited] (LUCENE-10404) Use hash set for visited nodes in HNSW search?

2022-07-21 Thread Michael Sokolov (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10404?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17569420#comment-17569420
 ] 

Michael Sokolov edited comment on LUCENE-10404 at 7/21/22 1:39 PM:
---

I tried using IntIntHashMap (mapping to 1 for visited nodes) and indeed does 
seem to be a small speedup. I haven't had a chance to run luceneutil nor look 
at profiler output, but here are some numbers from KnnGraphTester for an 
internal dataset. The numbers can be a bit noisy, but are consistently better 
for the hash map version.
h3. IntIntHashMap

recall  latency nDoc    fanout  maxConn beamWidth       visited index ms
0.935    0.37   1   0       16      32      100     1566
0.965    0.49   1   50      16      32      150     0
0.962    0.41   1   0       16      64      100     2655
0.982    0.57   1   50      16      64      150     0
0.941    0.38   1   0       32      32      100     1473
0.969    0.51   1   50      32      32      150     0
0.966    0.45   1   0       32      64      100     2611
0.985    0.59   1   50      32      64      150     0
0.907    0.52   10  0       16      32      100     19850
0.940    0.72   10  50      16      32      150     0
0.941    0.60   10  0       16      64      100     38614
0.966    0.84   10  50      16      64      150     0
0.916    0.55   10  0       32      32      100     19243
0.949    0.75   10  50      32      32      150     0
0.952    0.66   10  0       32      64      100     38205
0.973    0.93   10  50      32      64      150     0
0.859    0.66   100 0       16      32      100     273112
0.897    0.92   100 50      16      32      150     0

{{0.917    0.85   100 0       16      64      100     523325}}
{{0.946    1.06   100 50      16      64      150     0}}



more to come – pushed ctrl-enter instead of enter ...
h3. baseline

{{recall  latency nDoc    fanout  maxConn beamWidth       visited index ms}}
{{0.935    0.38   1   0       16      32      100     1614}}
{{0.965    0.50   1   50      16      32      150     0}}
{{0.962    0.45   1   0       16      64      100     2687}}
{{0.982    0.57   1   50      16      64      150     0}}
{{0.941    0.40   1   0       32      32      100     1504}}
{{0.969    0.51   1   50      32      32      150     0}}
{{0.966    0.44   1   0       32      64      100     2652}}
{{0.985    0.58   1   50      32      64      150     0}}
{{0.907    0.54   10  0       16      32      100     21449}}
{{0.940    0.74   10  50      16      32      150     0}}
{{0.941    0.64   10  0       16      64      100     39962}}
{{0.966    0.88   10  50      16      64      150     0}}
{{0.916    0.59   10  0       32      32      100     20554}}
{{0.949    0.80   10  50      32      32      150     0}}
{{0.952    0.72   10  0       32      64      100     40980}}
{{0.973    1.04   10  50      32      64      150     0}}
{{0.859    0.75   100 0       16      32      100     300514}}
{{0.897    0.96   100 50      16      32      150     0}}
{{0.917    0.84   100 0       16      64      100     563259}}
{{0.946    1.12   100 50      16      64      150     0}}
{{0.874    0.86   100 0       32      32      100     303186}}
{{0.913    1.09   100 50      32      32      150     0}}
{{0.929    1.04   100 0       32      64      100     580725}}
{{0.958    1.38   100 50      32      64      150     0}}


was (Author: sokolov):
I tried using IntIntHashMap (mapping to 1 for visited nodes) and indeed does 
seem to be a small speedup. I haven't had a chance to run luceneutil nor look 
at profiler output, but here are some numbers from KnnGraphTester for an 
internal dataset. The numbers can be a bit noisy, but are consistently better 
for the hash map version.
h3. IntIntHashMap

{{recall  latency nDoc    fanout  maxConn beamWidth       visited index ms}}
{{0.935    0.37   1   0       16      32      100     1566}}
{{0.965    0.49   1   50      16      32      150     0}}
{{0.962    0.41   1   0       16      64      100     2655}}
{{0.982    0.57   1   50      16      64      150     0}}
{{0.941    0.38   1   0       32      32      100     1473}}
{{0.969    0.51   1   50      32      32      150     0}}
{{0.966    0.45   1   0       32      64      100     2611}}
{{0.985    0.59   1   50      32      64      150     0}}
{{0.907    0.52   10  0       16      32      100     19850}}
{{0.940    0.72   10  50      16      32      150     0}}
{{0.941    0.60   10  0       16      64      100     38614}}
{{0.966    0.84   10  50      16      64      150     0}}
{{0.916

[jira] [Commented] (LUCENE-10404) Use hash set for visited nodes in HNSW search?

2022-07-21 Thread Michael Sokolov (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10404?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17569420#comment-17569420
 ] 

Michael Sokolov commented on LUCENE-10404:
--

I tried using IntIntHashMap (mapping to 1 for visited nodes) and indeed does 
seem to be a small speedup. I haven't had a chance to run luceneutil nor look 
at profiler output, but here are some numbers from KnnGraphTester for an 
internal dataset. The numbers can be a bit noisy, but are consistently better 
for the hash map version.
h3. IntIntHashMap

{{recall  latency nDoc    fanout  maxConn beamWidth       visited index ms}}
{{0.935    0.37   1   0       16      32      100     1566}}
{{0.965    0.49   1   50      16      32      150     0}}
{{0.962    0.41   1   0       16      64      100     2655}}
{{0.982    0.57   1   50      16      64      150     0}}
{{0.941    0.38   1   0       32      32      100     1473}}
{{0.969    0.51   1   50      32      32      150     0}}
{{0.966    0.45   1   0       32      64      100     2611}}
{{0.985    0.59   1   50      32      64      150     0}}
{{0.907    0.52   10  0       16      32      100     19850}}
{{0.940    0.72   10  50      16      32      150     0}}
{{0.941    0.60   10  0       16      64      100     38614}}
{{0.966    0.84   10  50      16      64      150     0}}
{{0.916    0.55   10  0       32      32      100     19243}}
{{0.949    0.75   10  50      32      32      150     0}}
{{0.952    0.66   10  0       32      64      100     38205}}
{{0.973    0.93   10  50      32      64      150     0}}
{{0.859    0.66   100 0       16      32      100     273112}}
{{{}0.897    0.92   100 50      16      32      150     0{}}}{{{}0.917    
0.85   100 0       16      64      100     523325
0.946    1.06   100 50      16      64      150     0
{}}}
h3. baseline

{{recall  latency nDoc    fanout  maxConn beamWidth       visited index ms}}
{{0.935    0.38   1   0       16      32      100     1614}}
{{0.965    0.50   1   50      16      32      150     0}}
{{0.962    0.45   1   0       16      64      100     2687}}
{{0.982    0.57   1   50      16      64      150     0}}
{{0.941    0.40   1   0       32      32      100     1504}}
{{0.969    0.51   1   50      32      32      150     0}}
{{0.966    0.44   1   0       32      64      100     2652}}
{{0.985    0.58   1   50      32      64      150     0}}
{{0.907    0.54   10  0       16      32      100     21449}}
{{0.940    0.74   10  50      16      32      150     0}}
{{0.941    0.64   10  0       16      64      100     39962}}
{{0.966    0.88   10  50      16      64      150     0}}
{{0.916    0.59   10  0       32      32      100     20554}}
{{0.949    0.80   10  50      32      32      150     0}}
{{0.952    0.72   10  0       32      64      100     40980}}
{{0.973    1.04   10  50      32      64      150     0}}
{{0.859    0.75   100 0       16      32      100     300514}}
{{0.897    0.96   100 50      16      32      150     0}}
{{0.917    0.84   100 0       16      64      100     563259}}
{{0.946    1.12   100 50      16      64      150     0}}
{{0.874    0.86   100 0       32      32      100     303186}}
{{0.913    1.09   100 50      32      32      150     0}}
{{0.929    1.04   100 0       32      64      100     580725}}
{{0.958    1.38   100 50      32      64      150     0}}

> Use hash set for visited nodes in HNSW search?
> --
>
> Key: LUCENE-10404
> URL: https://issues.apache.org/jira/browse/LUCENE-10404
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Julie Tibshirani
>Priority: Minor
>
> While searching each layer, HNSW tracks the nodes it has already visited 
> using a BitSet. We could look into using something like IntHashSet instead. I 
> tried out the idea quickly by switching to IntIntHashMap (which has already 
> been copied from hppc) and saw an improvement in index performance. 
> *Baseline:* 760896 msec to write vectors
> *Using IntIntHashMap:* 733017 msec to write vectors
> I noticed search performance actually got a little bit worse with the change 
> -- that is something to look into.
> For background, it's good to be aware that HNSW can visit a lot of nodes. For 
> example, on the glove-100-angular dataset with ~1.2 million docs, HNSW search 
> visits ~1000 - 15,000 docs depending on the recall. This number can increase 
> when searching with deleted docs, especially if you hit a "pathological" case 
> where the deleted docs happen to be closest to the query vector.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To

[jira] [Commented] (LUCENE-10655) can we optimize visited bitset usage in HNSW graph search/indexing?

2022-07-15 Thread Michael Sokolov (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17567366#comment-17567366
 ] 

Michael Sokolov commented on LUCENE-10655:
--

meh, I tried a few things, but nothing really moved the needle.

> can we optimize visited bitset usage in HNSW graph search/indexing?
> ---
>
> Key: LUCENE-10655
> URL: https://issues.apache.org/jira/browse/LUCENE-10655
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/hnsw
>Reporter: Michael Sokolov
>Priority: Major
>
> When running {{luceneutil}}  I noticed that {{FixedBitSet.clear()}} dominates 
> the CPU profiler output. I had a few ideas:
>  # In upper graph layers, the occupied nodes are very sparse - maybe 
> {{SparseFixedBitSet}} would be a better fit for those
>  # We are caching these bitsets, but they are only used for a single search 
> (single document insert, during indexing). Should we cache across searches? 
> We would need to pool them though, and they would vary by field since fields 
> can have different numbers of vector nodes. This starts to get complex
>  # Are we sure that clearing a bitset is more efficient than allocating a new 
> one? Maybe the JDK maintains a pool of already-zeroed memory for us
> I think we could try specializing the bitset type by graph level, and then I 
> think we ought to measure the performance of allocation vs the limited reuse 
> that we currently have.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Comment Edited] (LUCENE-10655) can we optimize visited bitset usage in HNSW graph search/indexing?

2022-07-15 Thread Michael Sokolov (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17567358#comment-17567358
 ] 

Michael Sokolov edited comment on LUCENE-10655 at 7/15/22 6:41 PM:
---

OK I was confused, and in fact we already do use SparseFixedBitSet for every 
layer, and we re-use the same one for the lifetime of a HnswGraphBuilder, which 
processes all the vectors. And I tried allocating afresh rather than clear-ing, 
and it was a bit slower. FixedBitSet.clear() is a hot-spot but it's not really 
clear what's to be done about it.

Perhaps since we are re-using we could try using a fully-allocated FixedBitSet 
(not sparse) when indexing? My concern is that over the lifetime of indexing 
many vectors, the sparse bit set will eventually become dense, but 
inefficiently. Oh I see - in fact that *is* what we do. Okay, returning to this 
again, I think I will try using that one for the fully-populated level only


was (Author: sokolov):
OK I was confused, and in fact we already do use SparseFixedBitSet for every 
layer, and we re-use the same one for the lifetime of a HnswGraphBuilder, which 
processes all the vectors. And I tried allocating afresh rather than clear-ing, 
and it was a bit slower. FixedBitSet.clear() is a hot-spot but it's not really 
clear what's to be done about it.

Perhaps since we are re-using we could try using a fully-allocated FixedBitSet 
(not sparse) when indexing? My concern is that over the lifetime of indexing 
many vectors, the sparse bit set will eventually become dense, but 
inefficiently.

> can we optimize visited bitset usage in HNSW graph search/indexing?
> ---
>
> Key: LUCENE-10655
> URL: https://issues.apache.org/jira/browse/LUCENE-10655
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/hnsw
>Reporter: Michael Sokolov
>Priority: Major
>
> When running {{luceneutil}}  I noticed that {{FixedBitSet.clear()}} dominates 
> the CPU profiler output. I had a few ideas:
>  # In upper graph layers, the occupied nodes are very sparse - maybe 
> {{SparseFixedBitSet}} would be a better fit for those
>  # We are caching these bitsets, but they are only used for a single search 
> (single document insert, during indexing). Should we cache across searches? 
> We would need to pool them though, and they would vary by field since fields 
> can have different numbers of vector nodes. This starts to get complex
>  # Are we sure that clearing a bitset is more efficient than allocating a new 
> one? Maybe the JDK maintains a pool of already-zeroed memory for us
> I think we could try specializing the bitset type by graph level, and then I 
> think we ought to measure the performance of allocation vs the limited reuse 
> that we currently have.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Comment Edited] (LUCENE-10655) can we optimize visited bitset usage in HNSW graph search/indexing?

2022-07-15 Thread Michael Sokolov (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17567358#comment-17567358
 ] 

Michael Sokolov edited comment on LUCENE-10655 at 7/15/22 6:39 PM:
---

OK I was confused, and in fact we already do use SparseFixedBitSet for every 
layer, and we re-use the same one for the lifetime of a HnswGraphBuilder, which 
processes all the vectors. And I tried allocating afresh rather than clear-ing, 
and it was a bit slower. FixedBitSet.clear() is a hot-spot but it's not really 
clear what's to be done about it.

Perhaps since we are re-using we could try using a fully-allocated FixedBitSet 
(not sparse) when indexing? My concern is that over the lifetime of indexing 
many vectors, the sparse bit set will eventually become dense, but 
inefficiently.


was (Author: sokolov):
OK I was confused, and in fact we already do use SparseFixedBitSet for every 
layer. And I tried allocating afresh rather than clear-ing, and it was a bit 
slower. FixedBitSet.clear() is a hot-spot but it's not really clear what's to 
be done about it.

> can we optimize visited bitset usage in HNSW graph search/indexing?
> ---
>
> Key: LUCENE-10655
> URL: https://issues.apache.org/jira/browse/LUCENE-10655
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/hnsw
>Reporter: Michael Sokolov
>Priority: Major
>
> When running {{luceneutil}}  I noticed that {{FixedBitSet.clear()}} dominates 
> the CPU profiler output. I had a few ideas:
>  # In upper graph layers, the occupied nodes are very sparse - maybe 
> {{SparseFixedBitSet}} would be a better fit for those
>  # We are caching these bitsets, but they are only used for a single search 
> (single document insert, during indexing). Should we cache across searches? 
> We would need to pool them though, and they would vary by field since fields 
> can have different numbers of vector nodes. This starts to get complex
>  # Are we sure that clearing a bitset is more efficient than allocating a new 
> one? Maybe the JDK maintains a pool of already-zeroed memory for us
> I think we could try specializing the bitset type by graph level, and then I 
> think we ought to measure the performance of allocation vs the limited reuse 
> that we currently have.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10655) can we optimize visited bitset usage in HNSW graph search/indexing?

2022-07-15 Thread Michael Sokolov (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17567358#comment-17567358
 ] 

Michael Sokolov commented on LUCENE-10655:
--

OK I was confused, and in fact we already do use SparseFixedBitSet for every 
layer. And I tried allocating afresh rather than clear-ing, and it was a bit 
slower. FixedBitSet.clear() is a hot-spot but it's not really clear what's to 
be done about it.

> can we optimize visited bitset usage in HNSW graph search/indexing?
> ---
>
> Key: LUCENE-10655
> URL: https://issues.apache.org/jira/browse/LUCENE-10655
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/hnsw
>Reporter: Michael Sokolov
>Priority: Major
>
> When running {{luceneutil}}  I noticed that {{FixedBitSet.clear()}} dominates 
> the CPU profiler output. I had a few ideas:
>  # In upper graph layers, the occupied nodes are very sparse - maybe 
> {{SparseFixedBitSet}} would be a better fit for those
>  # We are caching these bitsets, but they are only used for a single search 
> (single document insert, during indexing). Should we cache across searches? 
> We would need to pool them though, and they would vary by field since fields 
> can have different numbers of vector nodes. This starts to get complex
>  # Are we sure that clearing a bitset is more efficient than allocating a new 
> one? Maybe the JDK maintains a pool of already-zeroed memory for us
> I think we could try specializing the bitset type by graph level, and then I 
> think we ought to measure the performance of allocation vs the limited reuse 
> that we currently have.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10633) Dynamic pruning for queries sorted by SORTED(_SET) field

2022-07-15 Thread Michael Sokolov (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17567334#comment-17567334
 ] 

Michael Sokolov commented on LUCENE-10633:
--

Adrien that's crazy !

> Dynamic pruning for queries sorted by SORTED(_SET) field
> 
>
> Key: LUCENE-10633
> URL: https://issues.apache.org/jira/browse/LUCENE-10633
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Priority: Minor
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> LUCENE-9280 introduced the ability to dynamically prune non-competitive hits 
> when sorting by a numeric field, by leveraging the points index to skip 
> documents that do not compare better than the top of the priority queue 
> maintained by the field comparator.
> However queries sorted by a SORTED(_SET) field still look at all hits, which 
> is disappointing. Could we leverage the terms index to skip hits?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10151) Add timeout support to IndexSearcher

2022-07-15 Thread Michael Sokolov (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10151?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17567330#comment-17567330
 ] 

Michael Sokolov commented on LUCENE-10151:
--

> Did you forget to push to branch_9x? I cannot see the change there.

Yes! Thanks - pushed now

> Add timeout support to IndexSearcher
> 
>
> Key: LUCENE-10151
> URL: https://issues.apache.org/jira/browse/LUCENE-10151
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/search
>Reporter: Greg Miller
>Priority: Minor
> Fix For: 9.3
>
>  Time Spent: 3h 40m
>  Remaining Estimate: 0h
>
> I'd like to explore adding optional "timeout" capabilities to 
> {{IndexSearcher}}. This would enable users to (optionally) specify a maximum 
> time budget for search execution. If the search "times out", partial results 
> would be available.
> This idea originated on the dev list (thanks [~jpountz] for the suggestion). 
> Thread for reference: 
> [http://mail-archives.apache.org/mod_mbox/lucene-dev/202110.mbox/%3CCAL8PwkZdNGmYJopPjeXYK%3DF7rvLkWon91UEXVxMM4MeeJ3UHxQ%40mail.gmail.com%3E]
>  
> A couple things to watch out for with this change:
>  # We want to make sure it's robust to a two-phase query evaluation scenario 
> where the "approximate" step matches a large number of candidates but the 
> "confirmation" step matches very few (or none). This is a particularly tricky 
> case.
>  # We want to make sure the {{TotalHits#Relation}} reported by {{TopDocs}} is 
> {{GREATER_THAN_OR_EQUAL_TO}} if the query times out
>  # We want to make sure it plays nice with the {{LRUCache}} since it iterates 
> the query to pre-populate a {{BitSet}} when caching. That step shouldn't be 
> allowed to overrun the timeout. The proper way to handle this probably needs 
> some thought.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Created] (LUCENE-10655) can we optimize visited bitset usage in HNSW graph search/indexing?

2022-07-15 Thread Michael Sokolov (Jira)

Michael Sokolov created LUCENE-10655:


 Summary: can we optimize visited bitset usage in HNSW graph 
search/indexing?
 Key: LUCENE-10655
 URL: https://issues.apache.org/jira/browse/LUCENE-10655
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/hnsw
Reporter: Michael Sokolov


When running {{luceneutil}}  I noticed that {{FixedBitSet.clear()}} dominates 
the CPU profiler output. I had a few ideas:
 # In upper graph layers, the occupied nodes are very sparse - maybe 
{{SparseFixedBitSet}} would be a better fit for those
 # We are caching these bitsets, but they are only used for a single search 
(single document insert, during indexing). Should we cache across searches? We 
would need to pool them though, and they would vary by field since fields can 
have different numbers of vector nodes. This starts to get complex
 # Are we sure that clearing a bitset is more efficient than allocating a new 
one? Maybe the JDK maintains a pool of already-zeroed memory for us

I think we could try specializing the bitset type by graph level, and then I 
think we ought to measure the performance of allocation vs the limited reuse 
that we currently have.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10577) Quantize vector values

2022-07-13 Thread Michael Sokolov (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17566498#comment-17566498
 ] 

Michael Sokolov commented on LUCENE-10577:
--

I'm also not sure about Codec parameter vs FieldInfo, but it's clearly a 
lower-profile change to add to the Codec, and we could always extend it to the 
Field later?

I think you meant "we allow {*}signed{*}" byte values? Thanks for raising this 
- I had completely forgotten about the scoring angle. To keep the byte 
dot-product scores positive we can divide by (dimension * 2^14 (max product of 
two bytes)). Scores might end up quite small, but at least it will be safe and 
shouldn't lose any information.

Regarding Euclidean - consider that Euclidean is only different from 
dot-product when the vectors have different lengths (euclidean norms). If they 
are all the same, you might as well use dot product since it will lead to the 
same ordering (although the values will differ). On the other hand, if they are 
different, then quantization into a byte is necessarily going to lose more 
information since - if you scale by a large value, to get it to fit into a 
byte, then the precision of small values scaled by the same constant will be 
greatly reduced. I felt this made it a bad fit, and prevented it. But we could 
certainly implement euclidean distance over bytes. Maybe somebody smarter finds 
a use for it.

Also, currently I didn't do anything special about the similarity computation 
in KnnVectorQuery, where it is used when falling back to exact KNN. There was 
no test failure, because the codec will convert to float on demand, and this is 
what was going on in there. So it would be suboptimal in this case. But worse 
is that these floats will be large and negative and potentially lead to 
negative scores. To address this we may want to refactor/move the exact KNN 
computation into the vector Reader; ie {{KnnVectorsReader.exactSearch(String 
field, float[] target, int k).}}

> Quantize vector values
> --
>
> Key: LUCENE-10577
> URL: https://issues.apache.org/jira/browse/LUCENE-10577
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/codecs
>Reporter: Michael Sokolov
>Priority: Major
>  Time Spent: 2h 10m
>  Remaining Estimate: 0h
>
> The {{KnnVectorField}} api handles vectors with 4-byte floating point values. 
> These fields can be used (via {{KnnVectorsReader}}) in two main ways:
> 1. The {{VectorValues}} iterator enables retrieving values
> 2. Approximate nearest -neighbor search
> The main point of this addition was to provide the search capability, and to 
> support that it is not really necessary to store vectors in full precision. 
> Perhaps users may also be willing to retrieve values in lower precision for 
> whatever purpose those serve, if they are able to store more samples. We know 
> that 8 bits is enough to provide a very near approximation to the same 
> recall/performance tradeoff that is achieved with the full-precision vectors. 
> I'd like to explore how we could enable 4:1 compression of these fields by 
> reducing their precision.
> A few ways I can imagine this would be done:
> 1. Provide a parallel byte-oriented API. This would allow users to provide 
> their data in reduced-precision format and give control over the quantization 
> to them. It would have a major impact on the Lucene API surface though, 
> essentially requiring us to duplicate all of the vector APIs.
> 2. Automatically quantize the stored vector data when we can. This would 
> require no or perhaps very limited change to the existing API to enable the 
> feature.
> I've been exploring (2), and what I find is that we can achieve very good 
> recall results using dot-product similarity scoring by simple linear scaling 
> + quantization of the vector values, so long as  we choose the scale that 
> minimizes the quantization error. Dot-product is amenable to this treatment 
> since vectors are required to be unit-length when used with that similarity 
> function. 
>  Even still there is variability in the ideal scale over different data sets. 
> A good choice seems to be max(abs(min-value), abs(max-value)), but of course 
> this assumes that the data set doesn't have a few outlier data points. A 
> theoretical range can be obtained by 1/sqrt(dimension), but this is only 
> useful when the samples are normally distributed. We could in theory 
> determine the ideal scale when flushing a segment and manage this 
> quantization per-segment, but then numerical error could creep in when 
> merging.
> I'll post a patch/PR with an experimental setup I've been using for 
> evaluation purposes. It is pretty self-contained and simple, but has some 
> drawbacks that need to be addressed:
> 1. No automated mechanism for determining quantization scale (it's

[jira] [Commented] (LUCENE-10577) Quantize vector values

2022-07-12 Thread Michael Sokolov (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17566015#comment-17566015
 ] 

Michael Sokolov commented on LUCENE-10577:
--

OK, that makes sense to me – I'll see about moving the setting to the 
`Lucene93HnswVectorsFormat`

> Quantize vector values
> --
>
> Key: LUCENE-10577
> URL: https://issues.apache.org/jira/browse/LUCENE-10577
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/codecs
>Reporter: Michael Sokolov
>Priority: Major
>  Time Spent: 2h
>  Remaining Estimate: 0h
>
> The {{KnnVectorField}} api handles vectors with 4-byte floating point values. 
> These fields can be used (via {{KnnVectorsReader}}) in two main ways:
> 1. The {{VectorValues}} iterator enables retrieving values
> 2. Approximate nearest -neighbor search
> The main point of this addition was to provide the search capability, and to 
> support that it is not really necessary to store vectors in full precision. 
> Perhaps users may also be willing to retrieve values in lower precision for 
> whatever purpose those serve, if they are able to store more samples. We know 
> that 8 bits is enough to provide a very near approximation to the same 
> recall/performance tradeoff that is achieved with the full-precision vectors. 
> I'd like to explore how we could enable 4:1 compression of these fields by 
> reducing their precision.
> A few ways I can imagine this would be done:
> 1. Provide a parallel byte-oriented API. This would allow users to provide 
> their data in reduced-precision format and give control over the quantization 
> to them. It would have a major impact on the Lucene API surface though, 
> essentially requiring us to duplicate all of the vector APIs.
> 2. Automatically quantize the stored vector data when we can. This would 
> require no or perhaps very limited change to the existing API to enable the 
> feature.
> I've been exploring (2), and what I find is that we can achieve very good 
> recall results using dot-product similarity scoring by simple linear scaling 
> + quantization of the vector values, so long as  we choose the scale that 
> minimizes the quantization error. Dot-product is amenable to this treatment 
> since vectors are required to be unit-length when used with that similarity 
> function. 
>  Even still there is variability in the ideal scale over different data sets. 
> A good choice seems to be max(abs(min-value), abs(max-value)), but of course 
> this assumes that the data set doesn't have a few outlier data points. A 
> theoretical range can be obtained by 1/sqrt(dimension), but this is only 
> useful when the samples are normally distributed. We could in theory 
> determine the ideal scale when flushing a segment and manage this 
> quantization per-segment, but then numerical error could creep in when 
> merging.
> I'll post a patch/PR with an experimental setup I've been using for 
> evaluation purposes. It is pretty self-contained and simple, but has some 
> drawbacks that need to be addressed:
> 1. No automated mechanism for determining quantization scale (it's a constant 
> that I have been playing with)
> 2. Converts from byte/float when computing dot-product instead of directly 
> computing on byte values
> I'd like to get people's feedback on the approach and whether in general we 
> should think about doing this compression under the hood, or expose a 
> byte-oriented API. Whatever we do I think a 4:1 compression ratio is pretty 
> compelling and we should pursue something.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10577) Quantize vector values

2022-07-12 Thread Michael Sokolov (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17565914#comment-17565914
 ] 

Michael Sokolov commented on LUCENE-10577:
--

It would be nice if we could make this encoding an *entirely* internal detail, 
with no user-configuration, but I don't think we can, because:
 # the choice of quantization scaling factor has a significant impact on the 
lossiness and thus recall. It needs to be tuned for each dataset
 # Even if we were able to do this tuning in Lucene, automatically, we would 
have to do it per-segment, and then when we merge, we'd have to re-scale, and 
we would lose more precision then.

Because of this, I think we need to expose the ability for users to provide 
quantized data, and then they need some way of specifying for a given field 
whether it is byte-encoded or float-encoded. Although I do see that it could be 
done using the PerFieldKnnVectorsFormat - is that what you were saying, 
[~julietibs] ?

 

> Quantize vector values
> --
>
> Key: LUCENE-10577
> URL: https://issues.apache.org/jira/browse/LUCENE-10577
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/codecs
>Reporter: Michael Sokolov
>Priority: Major
>  Time Spent: 2h
>  Remaining Estimate: 0h
>
> The {{KnnVectorField}} api handles vectors with 4-byte floating point values. 
> These fields can be used (via {{KnnVectorsReader}}) in two main ways:
> 1. The {{VectorValues}} iterator enables retrieving values
> 2. Approximate nearest -neighbor search
> The main point of this addition was to provide the search capability, and to 
> support that it is not really necessary to store vectors in full precision. 
> Perhaps users may also be willing to retrieve values in lower precision for 
> whatever purpose those serve, if they are able to store more samples. We know 
> that 8 bits is enough to provide a very near approximation to the same 
> recall/performance tradeoff that is achieved with the full-precision vectors. 
> I'd like to explore how we could enable 4:1 compression of these fields by 
> reducing their precision.
> A few ways I can imagine this would be done:
> 1. Provide a parallel byte-oriented API. This would allow users to provide 
> their data in reduced-precision format and give control over the quantization 
> to them. It would have a major impact on the Lucene API surface though, 
> essentially requiring us to duplicate all of the vector APIs.
> 2. Automatically quantize the stored vector data when we can. This would 
> require no or perhaps very limited change to the existing API to enable the 
> feature.
> I've been exploring (2), and what I find is that we can achieve very good 
> recall results using dot-product similarity scoring by simple linear scaling 
> + quantization of the vector values, so long as  we choose the scale that 
> minimizes the quantization error. Dot-product is amenable to this treatment 
> since vectors are required to be unit-length when used with that similarity 
> function. 
>  Even still there is variability in the ideal scale over different data sets. 
> A good choice seems to be max(abs(min-value), abs(max-value)), but of course 
> this assumes that the data set doesn't have a few outlier data points. A 
> theoretical range can be obtained by 1/sqrt(dimension), but this is only 
> useful when the samples are normally distributed. We could in theory 
> determine the ideal scale when flushing a segment and manage this 
> quantization per-segment, but then numerical error could creep in when 
> merging.
> I'll post a patch/PR with an experimental setup I've been using for 
> evaluation purposes. It is pretty self-contained and simple, but has some 
> drawbacks that need to be addressed:
> 1. No automated mechanism for determining quantization scale (it's a constant 
> that I have been playing with)
> 2. Converts from byte/float when computing dot-product instead of directly 
> computing on byte values
> I'd like to get people's feedback on the approach and whether in general we 
> should think about doing this compression under the hood, or expose a 
> byte-oriented API. Whatever we do I think a 4:1 compression ratio is pretty 
> compelling and we should pursue something.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10471) Increase the number of dims for KNN vectors to 2048

2022-07-12 Thread Michael Sokolov (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17565468#comment-17565468
 ] 

Michael Sokolov commented on LUCENE-10471:
--

We should not be imposing an arbitrary limit that prevents people with CNNs 
(image-processing models) from using this feature. It makes sense to me to 
increase the limit to the point where we would see actual bugs/failures, or 
where the large numbers might prevent us from making some future optimization, 
rather than trying to determine where the performance stops being acceptable - 
that's a question for users to decide for themselves. Of course we don't know 
where that place is that we might want to optimize in the future (Rob and I 
discussed an idea using all-integer math that would suffer from overflow, but 
still we should not just allow MAX_INT dimensions I think? To me a limit like 
16K makes sense – well beyond any stated use case, but not effectively infinite?

> Increase the number of dims for KNN vectors to 2048
> ---
>
> Key: LUCENE-10471
> URL: https://issues.apache.org/jira/browse/LUCENE-10471
> Project: Lucene - Core
>  Issue Type: Wish
>Reporter: Mayya Sharipova
>Priority: Trivial
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> The current maximum allowed number of dimensions is equal to 1024. But we see 
> in practice a couple well-known models that produce vectors with > 1024 
> dimensions (e.g 
> [mobilenet_v2|https://tfhub.dev/google/imagenet/mobilenet_v2_035_224/feature_vector/1]
>  uses 1280d vectors, OpenAI / GPT-3 Babbage uses 2048d vectors). Increasing 
> max dims to `2048` will satisfy these use cases.
> I am wondering if anybody has strong objections against this.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10577) Quantize vector values

2022-07-09 Thread Michael Sokolov (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17564577#comment-17564577
 ] 

Michael Sokolov commented on LUCENE-10577:
--

> Please do this without adding additional similarity function named 
>DOT_PRODUCT8 which is really just {*}different encoding{*}. Thanks.

I would like to find a way forward here, but after digging around in our 
codecs, I have to admit I don't really understand what you're getting at, 
@rcmuir. Can you be more explicit?

I tried looking at how DocValues are handling this since there is only one 
Codec and {*}one DocValuesFormat{*}, which to my mind means one codec, but it 
supports many different DocValues field types. I just don't understand what you 
mean by "scaling out horizontally with more codecs"? Is this about the actual 
file formats and not the java classes that represent them? I mean honestly if I 
look at Lucene90DocValuesConsumer it just exactly the sort of 
"wonder-do-it-all" thing  you are calling out. Do you think that should have 
been done differently too?

> Quantize vector values
> --
>
> Key: LUCENE-10577
> URL: https://issues.apache.org/jira/browse/LUCENE-10577
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/codecs
>Reporter: Michael Sokolov
>Priority: Major
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> The {{KnnVectorField}} api handles vectors with 4-byte floating point values. 
> These fields can be used (via {{KnnVectorsReader}}) in two main ways:
> 1. The {{VectorValues}} iterator enables retrieving values
> 2. Approximate nearest -neighbor search
> The main point of this addition was to provide the search capability, and to 
> support that it is not really necessary to store vectors in full precision. 
> Perhaps users may also be willing to retrieve values in lower precision for 
> whatever purpose those serve, if they are able to store more samples. We know 
> that 8 bits is enough to provide a very near approximation to the same 
> recall/performance tradeoff that is achieved with the full-precision vectors. 
> I'd like to explore how we could enable 4:1 compression of these fields by 
> reducing their precision.
> A few ways I can imagine this would be done:
> 1. Provide a parallel byte-oriented API. This would allow users to provide 
> their data in reduced-precision format and give control over the quantization 
> to them. It would have a major impact on the Lucene API surface though, 
> essentially requiring us to duplicate all of the vector APIs.
> 2. Automatically quantize the stored vector data when we can. This would 
> require no or perhaps very limited change to the existing API to enable the 
> feature.
> I've been exploring (2), and what I find is that we can achieve very good 
> recall results using dot-product similarity scoring by simple linear scaling 
> + quantization of the vector values, so long as  we choose the scale that 
> minimizes the quantization error. Dot-product is amenable to this treatment 
> since vectors are required to be unit-length when used with that similarity 
> function. 
>  Even still there is variability in the ideal scale over different data sets. 
> A good choice seems to be max(abs(min-value), abs(max-value)), but of course 
> this assumes that the data set doesn't have a few outlier data points. A 
> theoretical range can be obtained by 1/sqrt(dimension), but this is only 
> useful when the samples are normally distributed. We could in theory 
> determine the ideal scale when flushing a segment and manage this 
> quantization per-segment, but then numerical error could creep in when 
> merging.
> I'll post a patch/PR with an experimental setup I've been using for 
> evaluation purposes. It is pretty self-contained and simple, but has some 
> drawbacks that need to be addressed:
> 1. No automated mechanism for determining quantization scale (it's a constant 
> that I have been playing with)
> 2. Converts from byte/float when computing dot-product instead of directly 
> computing on byte values
> I'd like to get people's feedback on the approach and whether in general we 
> should think about doing this compression under the hood, or expose a 
> byte-oriented API. Whatever we do I think a 4:1 compression ratio is pretty 
> compelling and we should pursue something.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Resolved] (LUCENE-10151) Add timeout support to IndexSearcher

2022-06-29 Thread Michael Sokolov (Jira)

Title: Message Title


 
 
 
 

 
 
 

 
   
 Michael Sokolov resolved as Fixed  
 

  
 
 
 
 

 
 
  
 
 
 
 

 
 Lucene - Core /  LUCENE-10151  
 
 
  Add timeout support to IndexSearcher   
 

  
 
 
 
 

 
Change By: 
 Michael Sokolov  
 
 
Fix Version/s: 
 9.3  
 
 
Resolution: 
 Fixed  
 
 
Status: 
 Open Resolved  
 

  
 
 
 
 

 
 
 

 
 
 Add Comment  
 

  
 

  
 
 
 
  
 

  
 
 
 
 

 
 This message was sent by Atlassian Jira (v8.20.10#820010-sha1:ace47f9)

[jira] [Commented] (LUCENE-10151) Add timeout support to IndexSearcher

2022-06-29 Thread Michael Sokolov (Jira)

Title: Message Title


 
 
 
 

 
 
 

 
   
 Michael Sokolov commented on  LUCENE-10151  
 

  
 
 
 
 

 
 
  
 
 
 
 

 
  Re: Add timeout support to IndexSearcher   
 

  
 
 
 
 

 
 Thanks, Deepika Sharma I've merged this now to main and backported to 9.x One oddity I noticed was a linter failure that happened on 9.x only, but not on main? I don't know if we may have relaxed some checks on main? In any case I added a patch for both branches, which is this change: https://gitbox.apache.org/repos/asf?p=lucene.git;a=commit;h=e078bc1cd9c1e647f963fbdd55cbcd4ec59fac94  
 

  
 
 
 
 

 
 
 

 
 
 Add Comment  
 

  
 

  
 
 
 
  
 

  
 
 
 
 

 
 This message was sent by Atlassian Jira (v8.20.10#820010-sha1:ace47f9)

[jira] [Commented] (LUCENE-10577) Quantize vector values

2022-06-16 Thread Michael Sokolov (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17555170#comment-17555170
 ] 

Michael Sokolov commented on LUCENE-10577:
--

I'm open to doing this with a different API. I tried to avoid massive code 
duplication and extra boilerplate, which is where I think creating yet another 
codec would lead, but I'd be happy to be proven wrong. That's why I tried to 
keep the HNSW util classes well-factored rather than introducing byte-oriented 
version and a float-oriented version which I think would be nightmarish to 
maintain since almost all code would be identical. Kind of analogous to the way 
FST allows you to work with different datatypes. If we want to pull out the 
comparison function into somewhere else, that seems fine, but I don't see how 
that would work. The API [~julietibs] proposed above 
(VectorValues#similarity(float[]))  would have to re-convert (the query vector) 
from float[]->byte[] for every document it compares against, wouldn't it?

> Quantize vector values
> --
>
> Key: LUCENE-10577
> URL: https://issues.apache.org/jira/browse/LUCENE-10577
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/codecs
>Reporter: Michael Sokolov
>Priority: Major
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> The {{KnnVectorField}} api handles vectors with 4-byte floating point values. 
> These fields can be used (via {{KnnVectorsReader}}) in two main ways:
> 1. The {{VectorValues}} iterator enables retrieving values
> 2. Approximate nearest -neighbor search
> The main point of this addition was to provide the search capability, and to 
> support that it is not really necessary to store vectors in full precision. 
> Perhaps users may also be willing to retrieve values in lower precision for 
> whatever purpose those serve, if they are able to store more samples. We know 
> that 8 bits is enough to provide a very near approximation to the same 
> recall/performance tradeoff that is achieved with the full-precision vectors. 
> I'd like to explore how we could enable 4:1 compression of these fields by 
> reducing their precision.
> A few ways I can imagine this would be done:
> 1. Provide a parallel byte-oriented API. This would allow users to provide 
> their data in reduced-precision format and give control over the quantization 
> to them. It would have a major impact on the Lucene API surface though, 
> essentially requiring us to duplicate all of the vector APIs.
> 2. Automatically quantize the stored vector data when we can. This would 
> require no or perhaps very limited change to the existing API to enable the 
> feature.
> I've been exploring (2), and what I find is that we can achieve very good 
> recall results using dot-product similarity scoring by simple linear scaling 
> + quantization of the vector values, so long as  we choose the scale that 
> minimizes the quantization error. Dot-product is amenable to this treatment 
> since vectors are required to be unit-length when used with that similarity 
> function. 
>  Even still there is variability in the ideal scale over different data sets. 
> A good choice seems to be max(abs(min-value), abs(max-value)), but of course 
> this assumes that the data set doesn't have a few outlier data points. A 
> theoretical range can be obtained by 1/sqrt(dimension), but this is only 
> useful when the samples are normally distributed. We could in theory 
> determine the ideal scale when flushing a segment and manage this 
> quantization per-segment, but then numerical error could creep in when 
> merging.
> I'll post a patch/PR with an experimental setup I've been using for 
> evaluation purposes. It is pretty self-contained and simple, but has some 
> drawbacks that need to be addressed:
> 1. No automated mechanism for determining quantization scale (it's a constant 
> that I have been playing with)
> 2. Converts from byte/float when computing dot-product instead of directly 
> computing on byte values
> I'd like to get people's feedback on the approach and whether in general we 
> should think about doing this compression under the hood, or expose a 
> byte-oriented API. Whatever we do I think a 4:1 compression ratio is pretty 
> compelling and we should pursue something.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10612) Add parameters for HNSW codec in Lucene93Codec

2022-06-14 Thread Michael Sokolov (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10612?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17554232#comment-17554232
 ] 

Michael Sokolov commented on LUCENE-10612:
--

> Actually, the change I'm proposing is to make it possible to specify the 
> parameters for HNSM without the need to know which HNWS codec is used 
> underlying.

 

I think the idea is that we might choose in the future to use a different 
nearest-neighbor algorithm that would not support the same configuration 
parameters as HNSW. The public-facing API is deliberately not specific to HNSW.

> Add parameters for HNSW codec in Lucene93Codec
> --
>
> Key: LUCENE-10612
> URL: https://issues.apache.org/jira/browse/LUCENE-10612
> Project: Lucene - Core
>  Issue Type: Task
>  Components: core/codecs
>Reporter: Elia Porciani
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Currently, it is possible to specify only the compression mode for stored 
> fields in the LuceneXXCodec constructors.
> With the introduction of HNSW graph, and the LuceneXXHnswCodecFormat, 
> LuceneXXCodec should provide an easy way to specify custom parameters for 
> HNSW graph layout:
> * maxConn
> * beamWidth



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10599) Improve LogMergePolicy's handling of maxMergeSize

2022-06-02 Thread Michael Sokolov (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17545487#comment-17545487
 ] 

Michael Sokolov commented on LUCENE-10599:
--

I don't have any deep understanding of the log merge policy, but this grouping 
operation you describe sounds buggy; +1 to improve it to be less jagged.

> Improve LogMergePolicy's handling of maxMergeSize
> -
>
> Key: LUCENE-10599
> URL: https://issues.apache.org/jira/browse/LUCENE-10599
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Priority: Minor
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> LogMergePolicy excludes from merging segments whose size is greater than or 
> equal to maxMergeSize. Since a segment whose size is maxMergeSize-1 is still 
> considered for merging, segments will effectively reach a size somewhere 
> between maxMergeSize and mergeFactor*maxMergeSize before they are not 
> considered for merging anymore.
> At least this is what I thought. When LogMergePolicy ignores a segment that 
> is too large for merging, it also ignores other segments that are in the same 
> window of mergeFactor segments for merging if they are on the same tier. So 
> actually segments might reach a size that is somewhere between maxMergeSize / 
> mergeFactor^0.75 and maxMergeSize * mergeFactor before they are not 
> considered for merging anymore.
> Assuming a merge factor of 10 and a max merge size of 1,000 this means that 
> segments will reach their maximum size somewhere between 178 and 10,000. This 
> range is too large and makes maxMergeSize too hard to reason about?
> Specifically, if you have 10 999-docs segments, then LogDocMergePolicy will 
> happily merge them into a single 9990-docs segment. However if you have one 
> 1,000 segment and 9 180-docs segments, then the 180-docs segments will not 
> get merged with any other segment, even if you keep adding segments to the 
> index.
> I propose to change this behavior so that when a large segment is 
> encountered, then we wouldn't skip the entire window of mergeFactor segments, 
> but just the segments that are too large.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Comment Edited] (LUCENE-10590) Indexing all zero vectors leads to heat death of the universe

2022-05-25 Thread Michael Sokolov (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17542068#comment-17542068
 ] 

Michael Sokolov edited comment on LUCENE-10590 at 5/25/22 2:09 PM:
---

bq. Does the indexing logic rely on tie breaking by node ID? If not, maybe 
index-time graph search could stop as soon as the k-th nearest vector is equal 
to the input vector?

Seems like that could work although to date we use the same search 
implementation at index time and search time, which is a nice simplification. 
Perhaps in such a case we could also sacrifice the docid tiebreaking given that 
is going to be best effort only


was (Author: sokolov):
> Does the indexing logic rely on tie breaking by node ID? If not, maybe 
> index-time graph search could stop as soon as the k-th nearest vector is 
> equal to the input vector?

Seems like that could work although to date we use the same search 
implementation at index time and search time, which is a nice simplification. 
Perhaps in such a case we could also sacrifice the docid tiebreaking given that 
is going to be best effort only

> Indexing all zero vectors leads to heat death of the universe
> -
>
> Key: LUCENE-10590
> URL: https://issues.apache.org/jira/browse/LUCENE-10590
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Michael Sokolov
>Priority: Major
>
> By accident while testing something else, I ran a luceneutil test indexing 1M 
> 100d vectors where all the vectors were all zeroes. This caused indexing to 
> take a very long time (~40x normal - it did eventually complete) and the 
> search performance was similarly bad.  We should not degrade by orders of 
> magnitude with even the worst data though.
> I'm not entirely sure what the issue is, but perhaps as long as we keep 
> finding hits that are "better" we keep exploring the graph, where better 
> means (score, -docid) >= (lowest score, -docid). If that's right and all docs 
> have the same score, then we probably need to either switch to > (but this 
> could lead to poorer recall in normal cases) or introduce some kind of 
> minimum score threshold?



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10590) Indexing all zero vectors leads to heat death of the universe

2022-05-25 Thread Michael Sokolov (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17542068#comment-17542068
 ] 

Michael Sokolov commented on LUCENE-10590:
--

> Does the indexing logic rely on tie breaking by node ID? If not, maybe 
> index-time graph search could stop as soon as the k-th nearest vector is 
> equal to the input vector?

Seems like that could work although to date we use the same search 
implementation at index time and search time, which is a nice simplification. 
Perhaps in such a case we could also sacrifice the docid tiebreaking given that 
is going to be best effort only

> Indexing all zero vectors leads to heat death of the universe
> -
>
> Key: LUCENE-10590
> URL: https://issues.apache.org/jira/browse/LUCENE-10590
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Michael Sokolov
>Priority: Major
>
> By accident while testing something else, I ran a luceneutil test indexing 1M 
> 100d vectors where all the vectors were all zeroes. This caused indexing to 
> take a very long time (~40x normal - it did eventually complete) and the 
> search performance was similarly bad.  We should not degrade by orders of 
> magnitude with even the worst data though.
> I'm not entirely sure what the issue is, but perhaps as long as we keep 
> finding hits that are "better" we keep exploring the graph, where better 
> means (score, -docid) >= (lowest score, -docid). If that's right and all docs 
> have the same score, then we probably need to either switch to > (but this 
> could lead to poorer recall in normal cases) or introduce some kind of 
> minimum score threshold?



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10577) Quantize vector values

2022-05-24 Thread Michael Sokolov (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17541621#comment-17541621
 ] 

Michael Sokolov commented on LUCENE-10577:
--

https://github.com/apache/lucene/pull/924/files is for creating Lucene93 Codec 
with no change. One question I have about that: how do we create the indexes 
that we check into backward-codecs tests, and do I need to do that?

> Quantize vector values
> --
>
> Key: LUCENE-10577
> URL: https://issues.apache.org/jira/browse/LUCENE-10577
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/codecs
>Reporter: Michael Sokolov
>Priority: Major
>
> The {{KnnVectorField}} api handles vectors with 4-byte floating point values. 
> These fields can be used (via {{KnnVectorsReader}}) in two main ways:
> 1. The {{VectorValues}} iterator enables retrieving values
> 2. Approximate nearest -neighbor search
> The main point of this addition was to provide the search capability, and to 
> support that it is not really necessary to store vectors in full precision. 
> Perhaps users may also be willing to retrieve values in lower precision for 
> whatever purpose those serve, if they are able to store more samples. We know 
> that 8 bits is enough to provide a very near approximation to the same 
> recall/performance tradeoff that is achieved with the full-precision vectors. 
> I'd like to explore how we could enable 4:1 compression of these fields by 
> reducing their precision.
> A few ways I can imagine this would be done:
> 1. Provide a parallel byte-oriented API. This would allow users to provide 
> their data in reduced-precision format and give control over the quantization 
> to them. It would have a major impact on the Lucene API surface though, 
> essentially requiring us to duplicate all of the vector APIs.
> 2. Automatically quantize the stored vector data when we can. This would 
> require no or perhaps very limited change to the existing API to enable the 
> feature.
> I've been exploring (2), and what I find is that we can achieve very good 
> recall results using dot-product similarity scoring by simple linear scaling 
> + quantization of the vector values, so long as  we choose the scale that 
> minimizes the quantization error. Dot-product is amenable to this treatment 
> since vectors are required to be unit-length when used with that similarity 
> function. 
>  Even still there is variability in the ideal scale over different data sets. 
> A good choice seems to be max(abs(min-value), abs(max-value)), but of course 
> this assumes that the data set doesn't have a few outlier data points. A 
> theoretical range can be obtained by 1/sqrt(dimension), but this is only 
> useful when the samples are normally distributed. We could in theory 
> determine the ideal scale when flushing a segment and manage this 
> quantization per-segment, but then numerical error could creep in when 
> merging.
> I'll post a patch/PR with an experimental setup I've been using for 
> evaluation purposes. It is pretty self-contained and simple, but has some 
> drawbacks that need to be addressed:
> 1. No automated mechanism for determining quantization scale (it's a constant 
> that I have been playing with)
> 2. Converts from byte/float when computing dot-product instead of directly 
> computing on byte values
> I'd like to get people's feedback on the approach and whether in general we 
> should think about doing this compression under the hood, or expose a 
> byte-oriented API. Whatever we do I think a 4:1 compression ratio is pretty 
> compelling and we should pursue something.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10577) Quantize vector values

2022-05-24 Thread Michael Sokolov (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17541583#comment-17541583
 ] 

Michael Sokolov commented on LUCENE-10577:
--

Question:  should I post one commit adding a new Lucene93 codec and 
Lucene93HnswVectorsFormat etc, and another one actually implementing these 
changes to the format, or smoosh them together into one gigantic change? I'm 
leaning towards separating and creating a new format that is just a clone of 
the existing one, and then following up with the actual changes.

> Quantize vector values
> --
>
> Key: LUCENE-10577
> URL: https://issues.apache.org/jira/browse/LUCENE-10577
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/codecs
>Reporter: Michael Sokolov
>Priority: Major
>
> The {{KnnVectorField}} api handles vectors with 4-byte floating point values. 
> These fields can be used (via {{KnnVectorsReader}}) in two main ways:
> 1. The {{VectorValues}} iterator enables retrieving values
> 2. Approximate nearest -neighbor search
> The main point of this addition was to provide the search capability, and to 
> support that it is not really necessary to store vectors in full precision. 
> Perhaps users may also be willing to retrieve values in lower precision for 
> whatever purpose those serve, if they are able to store more samples. We know 
> that 8 bits is enough to provide a very near approximation to the same 
> recall/performance tradeoff that is achieved with the full-precision vectors. 
> I'd like to explore how we could enable 4:1 compression of these fields by 
> reducing their precision.
> A few ways I can imagine this would be done:
> 1. Provide a parallel byte-oriented API. This would allow users to provide 
> their data in reduced-precision format and give control over the quantization 
> to them. It would have a major impact on the Lucene API surface though, 
> essentially requiring us to duplicate all of the vector APIs.
> 2. Automatically quantize the stored vector data when we can. This would 
> require no or perhaps very limited change to the existing API to enable the 
> feature.
> I've been exploring (2), and what I find is that we can achieve very good 
> recall results using dot-product similarity scoring by simple linear scaling 
> + quantization of the vector values, so long as  we choose the scale that 
> minimizes the quantization error. Dot-product is amenable to this treatment 
> since vectors are required to be unit-length when used with that similarity 
> function. 
>  Even still there is variability in the ideal scale over different data sets. 
> A good choice seems to be max(abs(min-value), abs(max-value)), but of course 
> this assumes that the data set doesn't have a few outlier data points. A 
> theoretical range can be obtained by 1/sqrt(dimension), but this is only 
> useful when the samples are normally distributed. We could in theory 
> determine the ideal scale when flushing a segment and manage this 
> quantization per-segment, but then numerical error could creep in when 
> merging.
> I'll post a patch/PR with an experimental setup I've been using for 
> evaluation purposes. It is pretty self-contained and simple, but has some 
> drawbacks that need to be addressed:
> 1. No automated mechanism for determining quantization scale (it's a constant 
> that I have been playing with)
> 2. Converts from byte/float when computing dot-product instead of directly 
> computing on byte values
> I'd like to get people's feedback on the approach and whether in general we 
> should think about doing this compression under the hood, or expose a 
> byte-oriented API. Whatever we do I think a 4:1 compression ratio is pretty 
> compelling and we should pursue something.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10590) Indexing all zero vectors leads to heat death of the universe

2022-05-23 Thread Michael Sokolov (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17541193#comment-17541193
 ] 

Michael Sokolov commented on LUCENE-10590:
--

Thanks Julie, this is definitely the same problem. I fiddled around with bounds 
checking but it's not so obvious how to fix this. I wonder if we can impose a 
default visitedLimit to avoid this kind of runaway explosion. Duplicate 
detection sounds challenging

> Indexing all zero vectors leads to heat death of the universe
> -
>
> Key: LUCENE-10590
> URL: https://issues.apache.org/jira/browse/LUCENE-10590
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Michael Sokolov
>Priority: Major
>
> By accident while testing something else, I ran a luceneutil test indexing 1M 
> 100d vectors where all the vectors were all zeroes. This caused indexing to 
> take a very long time (~40x normal - it did eventually complete) and the 
> search performance was similarly bad.  We should not degrade by orders of 
> magnitude with even the worst data though.
> I'm not entirely sure what the issue is, but perhaps as long as we keep 
> finding hits that are "better" we keep exploring the graph, where better 
> means (score, -docid) >= (lowest score, -docid). If that's right and all docs 
> have the same score, then we probably need to either switch to > (but this 
> could lead to poorer recall in normal cases) or introduce some kind of 
> minimum score threshold?



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10590) Indexing all zero vectors leads to heat death of the universe

2022-05-23 Thread Michael Sokolov (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17541180#comment-17541180
 ] 

Michael Sokolov commented on LUCENE-10590:
--

> Love the title, Michael Sokolov. Very Douglas-y Adams-y.
 :starry eyes:

So I wrote a unit test, wrapped the `RandomVectorValues.vectorValue(int)` 
method to see where it was being called, and fiddled around with 
`BoundsChecker` to see what would happen if we swapped its `<` with a `<=`, and 
what I found is that in the existing situation, indeed, we crawl over the 
entire graph every time we insert a node, because every node looks like a 
viable candidate (we only exclude nodes whose scores are `<` the current least 
score (or `>` for the inverse scoring functions)). But ... if we change to 
using `<=` (resp. `>=`) then the cost shifts over to 
`HnswGraphBuilder.findWorstNonDiverse` since there we early terminate in the 
opposite way.

Anyway that isn't very clear but the point is that these boundary conditions 
are sensitive to this equality case (where everything is equally distant to 
everything else) and they explode in different directions! Basically what we 
need to do is bias them to give up when stuff is exactly ==. Possibly 
BoundsChecker should get a new parameter (open/closed) 

> Indexing all zero vectors leads to heat death of the universe
> -
>
> Key: LUCENE-10590
> URL: https://issues.apache.org/jira/browse/LUCENE-10590
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Michael Sokolov
>Priority: Major
>
> By accident while testing something else, I ran a luceneutil test indexing 1M 
> 100d vectors where all the vectors were all zeroes. This caused indexing to 
> take a very long time (~40x normal - it did eventually complete) and the 
> search performance was similarly bad.  We should not degrade by orders of 
> magnitude with even the worst data though.
> I'm not entirely sure what the issue is, but perhaps as long as we keep 
> finding hits that are "better" we keep exploring the graph, where better 
> means (score, -docid) >= (lowest score, -docid). If that's right and all docs 
> have the same score, then we probably need to either switch to > (but this 
> could lead to poorer recall in normal cases) or introduce some kind of 
> minimum score threshold?



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Created] (LUCENE-10590) Indexing all zero vectors leads to heat death of the universe

2022-05-23 Thread Michael Sokolov (Jira)

Michael Sokolov created LUCENE-10590:


 Summary: Indexing all zero vectors leads to heat death of the 
universe
 Key: LUCENE-10590
 URL: https://issues.apache.org/jira/browse/LUCENE-10590
 Project: Lucene - Core
  Issue Type: Bug
Reporter: Michael Sokolov


By accident while testing something else, I ran a luceneutil test indexing 1M 
100d vectors where all the vectors were all zeroes. This caused indexing to 
take a very long time (~40x normal - it did eventually complete) and the search 
performance was similarly bad.  We should not degrade by orders of magnitude 
with even the worst data though.

I'm not entirely sure what the issue is, but perhaps as long as we keep finding 
hits that are "better" we keep exploring the graph, where better means (score, 
-docid) >= (lowest score, -docid). If that's right and all docs have the same 
score, then we probably need to either switch to > (but this could lead to 
poorer recall in normal cases) or introduce some kind of minimum score 
threshold?



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-9625) Benchmark KNN search with ann-benchmarks

2022-05-20 Thread Michael Sokolov (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-9625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17540284#comment-17540284
 ] 

Michael Sokolov commented on LUCENE-9625:
-

I realized maybe this deserves a better explanation: I didn't use 
multi-threading in the KnnGraphTester that builds the index since my goal at 
the time was really to evaluate whether our algorithm implementation is correct 
and how it performs on a single HNSW graph index. If we use multiple threads, 
this is going to lead to a more fragmented graph due to the way Lucene indexes 
segments, and while this would be a useful point of comparison, it also creates 
a different variable to tune in the benchmark evaluation. If you do want to 
pursue this, I would suggest configuring the IndexWriterConfig with large 
buffers so that each thread creates a single segment, and exposing the number 
of threads/segments as a tunable parameter since it is going to impact the 
recall and latency reported by the benchmark

> Benchmark KNN search with ann-benchmarks
> 
>
> Key: LUCENE-9625
> URL: https://issues.apache.org/jira/browse/LUCENE-9625
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Michael Sokolov
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> In addition to benchmarking with luceneutil, it would be good to be able to 
> make use of ann-benchmarks, which is publishing results from many approximate 
> knn algorithms, including the hnsw implementation from its authors. We don't 
> expect to challenge the performance of these native code libraries, however 
> it would be good to know just how far off we are.
> I started looking into this and posted a fork of ann-benchmarks that uses 
> KnnGraphTester  class to run these: 
> https://github.com/msokolov/ann-benchmarks. It's still a WIP; you have to 
> manually copy jars and the KnnGraphTester.class to the test host machine 
> rather than downloading from a distribution. KnnGraphTester needs some 
> modifications in order to support this process - this issue is mostly about 
> that.
> One thing I noticed is that some of the index builds with higher fanout 
> (efConstruction) settings time out at 2h (on an AWS c5 instance), so this is 
> concerning and I'll open a separate issue for trying to improve that.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10574) Remove O(n^2) from TieredMergePolicy or change defaults to one that doesn't do this

2022-05-20 Thread Michael Sokolov (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17540280#comment-17540280
 ] 

Michael Sokolov commented on LUCENE-10574:
--

Very nice! I confess I don't really know how those benchmarks are constructed 
-- do they have "typical" merge policy/update rate/etc or are they deliberately 
exploring this pathological case?

> Remove O(n^2) from TieredMergePolicy or change defaults to one that doesn't 
> do this
> ---
>
> Key: LUCENE-10574
> URL: https://issues.apache.org/jira/browse/LUCENE-10574
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Robert Muir
>Priority: Major
> Fix For: 9.3
>
>  Time Spent: 2h 10m
>  Remaining Estimate: 0h
>
> Remove {{floorSegmentBytes}} parameter, or change lucene's default to a merge 
> policy that doesn't merge in an O(n^2) way.
> I have the feeling it might have to be the latter, as folks seem really wed 
> to this crazy O(n^2) behavior.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Resolved] (LUCENE-10504) KnnGraphTester should use KnnVectorQuery

2022-05-19 Thread Michael Sokolov (Jira)



 [ 
https://issues.apache.org/jira/browse/LUCENE-10504?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Sokolov resolved LUCENE-10504.
--
Fix Version/s: 10.0 (main)
   Resolution: Fixed

> KnnGraphTester should use KnnVectorQuery
> 
>
> Key: LUCENE-10504
> URL: https://issues.apache.org/jira/browse/LUCENE-10504
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Michael Sokolov
>Assignee: Michael Sokolov
>Priority: Major
> Fix For: 10.0 (main)
>
>  Time Spent: 1h 40m
>  Remaining Estimate: 0h
>
> to get a more realistic picture, and to track developments in the query 
> implementation, the tester should use that rather than implementing its own 
> per-segment search and merging logic.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Assigned] (LUCENE-10504) KnnGraphTester should use KnnVectorQuery

2022-05-19 Thread Michael Sokolov (Jira)



 [ 
https://issues.apache.org/jira/browse/LUCENE-10504?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Sokolov reassigned LUCENE-10504:


Assignee: Michael Sokolov

> KnnGraphTester should use KnnVectorQuery
> 
>
> Key: LUCENE-10504
> URL: https://issues.apache.org/jira/browse/LUCENE-10504
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Michael Sokolov
>Assignee: Michael Sokolov
>Priority: Major
>  Time Spent: 1h 40m
>  Remaining Estimate: 0h
>
> to get a more realistic picture, and to track developments in the query 
> implementation, the tester should use that rather than implementing its own 
> per-segment search and merging logic.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10559) Add preFilter/postFilter options to KnnGraphTester

2022-05-19 Thread Michael Sokolov (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17539583#comment-17539583
 ] 

Michael Sokolov commented on LUCENE-10559:
--

I think it makes sense to use a fixed bit set so that we can test HNSW 
performance with filtering independently from the cost of the filter Query. I 
think your test seems to be demonstrating that for similar latencies (~cost) we 
can achieve significantly higher recall with pre-filtering?  I wonder if we 
could also demonstrate the converse -- what effective topK is required when 
post-filtering to drive recall to be the same as pre-filtering?

Also, these recall numbers seem curiously high, higher than we usually see. 
Could you publish the graph construction and HNSW search time parameters you 
used? I'm also wondering whether perhaps you tested with vectors from the 
training set?

> Add preFilter/postFilter options to KnnGraphTester
> --
>
> Key: LUCENE-10559
> URL: https://issues.apache.org/jira/browse/LUCENE-10559
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Michael Sokolov
>Priority: Major
>
> We want to be able to test the efficacy of pre-filtering in KnnVectorQuery: 
> if you (say) want the top K nearest neighbors subject to a constraint Q, are 
> you better off over-selecting (say 2K) top hits and *then* filtering 
> (post-filtering), or incorporating the filtering into the query 
> (pre-filtering). How does it depend on the selectivity of the filter?
> I think we can get a reasonable testbed by generating a uniform random filter 
> with some selectivity (that is consistent and repeatable). Possibly we'd also 
> want to try filters that are correlated with index order, but it seems they'd 
> be unlikely to be correlated with vector values in a way that the graph 
> structure would notice, so random is a pretty good starting point for this.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10577) Quantize vector values

2022-05-17 Thread Michael Sokolov (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17538494#comment-17538494
 ] 

Michael Sokolov commented on LUCENE-10577:
--

> So if you can do stuff directly with ByteVector that would be fine. Also if 
> you can use "poor man's vector" with varhandles and a 64-bit long to operate 
> on the byte values, thats fine too. But please nothing that only works "one 
> at a time".

+1 -- that is what I have done in my prototype (one-at-a-time conversion from 
byte to float), but it is not we would ship.

By the way, I tried out the attached prototype on some sample data from work 
plus also on Stanford GloVe 200 data and got reasonable results. For the best 
scale value, recall stays within about 1% of baseline. Latency increased a bit 
in some cases (as much as 25%) but decreased in others?!

> Quantize vector values
> --
>
> Key: LUCENE-10577
> URL: https://issues.apache.org/jira/browse/LUCENE-10577
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/codecs
>Reporter: Michael Sokolov
>Priority: Major
>
> The {{KnnVectorField}} api handles vectors with 4-byte floating point values. 
> These fields can be used (via {{KnnVectorsReader}}) in two main ways:
> 1. The {{VectorValues}} iterator enables retrieving values
> 2. Approximate nearest -neighbor search
> The main point of this addition was to provide the search capability, and to 
> support that it is not really necessary to store vectors in full precision. 
> Perhaps users may also be willing to retrieve values in lower precision for 
> whatever purpose those serve, if they are able to store more samples. We know 
> that 8 bits is enough to provide a very near approximation to the same 
> recall/performance tradeoff that is achieved with the full-precision vectors. 
> I'd like to explore how we could enable 4:1 compression of these fields by 
> reducing their precision.
> A few ways I can imagine this would be done:
> 1. Provide a parallel byte-oriented API. This would allow users to provide 
> their data in reduced-precision format and give control over the quantization 
> to them. It would have a major impact on the Lucene API surface though, 
> essentially requiring us to duplicate all of the vector APIs.
> 2. Automatically quantize the stored vector data when we can. This would 
> require no or perhaps very limited change to the existing API to enable the 
> feature.
> I've been exploring (2), and what I find is that we can achieve very good 
> recall results using dot-product similarity scoring by simple linear scaling 
> + quantization of the vector values, so long as  we choose the scale that 
> minimizes the quantization error. Dot-product is amenable to this treatment 
> since vectors are required to be unit-length when used with that similarity 
> function. 
>  Even still there is variability in the ideal scale over different data sets. 
> A good choice seems to be max(abs(min-value), abs(max-value)), but of course 
> this assumes that the data set doesn't have a few outlier data points. A 
> theoretical range can be obtained by 1/sqrt(dimension), but this is only 
> useful when the samples are normally distributed. We could in theory 
> determine the ideal scale when flushing a segment and manage this 
> quantization per-segment, but then numerical error could creep in when 
> merging.
> I'll post a patch/PR with an experimental setup I've been using for 
> evaluation purposes. It is pretty self-contained and simple, but has some 
> drawbacks that need to be addressed:
> 1. No automated mechanism for determining quantization scale (it's a constant 
> that I have been playing with)
> 2. Converts from byte/float when computing dot-product instead of directly 
> computing on byte values
> I'd like to get people's feedback on the approach and whether in general we 
> should think about doing this compression under the hood, or expose a 
> byte-oriented API. Whatever we do I think a 4:1 compression ratio is pretty 
> compelling and we should pursue something.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Comment Edited] (LUCENE-10577) Quantize vector values

2022-05-17 Thread Michael Sokolov (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17538467#comment-17538467
 ] 

Michael Sokolov edited comment on LUCENE-10577 at 5/17/22 8:57 PM:
---

Okay, thanks for the link [~rcmuir]I do see ByteVector.mul(ByteVector) and so 
on. And, yes [~jpountz]I think that could work for an API. It would be nice to 
let users worry about making their data in the right shape. I think it might 
make more sense to expect signed values though?

There do seem to be 8-bit vectorized instructions for Intel chips at least 
https://www.intel.com/content/www/us/en/develop/documentation/cpp-compiler-developer-guide-and-reference/top/compiler-reference/intrinsics/intrinsics-for-intel-advanced-vector-extensions-2/intrinsics-for-arithmetic-operations-2/mm256-add-epi8-16-32-64.html
 

I agree we should measure, but also the JDK support here seems to be a moving 
target. Perhaps it's time to give it another whirl and see where we are now 
with JDK 18/19


was (Author: sokolov):
Okay, thanks for the link [~rcmuir]I do see ByteVector.mul(ByteVector) and so 
on. And, yes [~jpountz]I think that could work for an API. It would be nice to 
let users worry about making their data in the right shape. I think it might 
make more sense to expect signed values though?

> Quantize vector values
> --
>
> Key: LUCENE-10577
> URL: https://issues.apache.org/jira/browse/LUCENE-10577
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/codecs
>Reporter: Michael Sokolov
>Priority: Major
>
> The {{KnnVectorField}} api handles vectors with 4-byte floating point values. 
> These fields can be used (via {{KnnVectorsReader}}) in two main ways:
> 1. The {{VectorValues}} iterator enables retrieving values
> 2. Approximate nearest -neighbor search
> The main point of this addition was to provide the search capability, and to 
> support that it is not really necessary to store vectors in full precision. 
> Perhaps users may also be willing to retrieve values in lower precision for 
> whatever purpose those serve, if they are able to store more samples. We know 
> that 8 bits is enough to provide a very near approximation to the same 
> recall/performance tradeoff that is achieved with the full-precision vectors. 
> I'd like to explore how we could enable 4:1 compression of these fields by 
> reducing their precision.
> A few ways I can imagine this would be done:
> 1. Provide a parallel byte-oriented API. This would allow users to provide 
> their data in reduced-precision format and give control over the quantization 
> to them. It would have a major impact on the Lucene API surface though, 
> essentially requiring us to duplicate all of the vector APIs.
> 2. Automatically quantize the stored vector data when we can. This would 
> require no or perhaps very limited change to the existing API to enable the 
> feature.
> I've been exploring (2), and what I find is that we can achieve very good 
> recall results using dot-product similarity scoring by simple linear scaling 
> + quantization of the vector values, so long as  we choose the scale that 
> minimizes the quantization error. Dot-product is amenable to this treatment 
> since vectors are required to be unit-length when used with that similarity 
> function. 
>  Even still there is variability in the ideal scale over different data sets. 
> A good choice seems to be max(abs(min-value), abs(max-value)), but of course 
> this assumes that the data set doesn't have a few outlier data points. A 
> theoretical range can be obtained by 1/sqrt(dimension), but this is only 
> useful when the samples are normally distributed. We could in theory 
> determine the ideal scale when flushing a segment and manage this 
> quantization per-segment, but then numerical error could creep in when 
> merging.
> I'll post a patch/PR with an experimental setup I've been using for 
> evaluation purposes. It is pretty self-contained and simple, but has some 
> drawbacks that need to be addressed:
> 1. No automated mechanism for determining quantization scale (it's a constant 
> that I have been playing with)
> 2. Converts from byte/float when computing dot-product instead of directly 
> computing on byte values
> I'd like to get people's feedback on the approach and whether in general we 
> should think about doing this compression under the hood, or expose a 
> byte-oriented API. Whatever we do I think a 4:1 compression ratio is pretty 
> compelling and we should pursue something.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10577) Quantize vector values

2022-05-17 Thread Michael Sokolov (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17538467#comment-17538467
 ] 

Michael Sokolov commented on LUCENE-10577:
--

Okay, thanks for the link [~rcmuir]I do see ByteVector.mul(ByteVector) and so 
on. And, yes [~jpountz]I think that could work for an API. It would be nice to 
let users worry about making their data in the right shape. I think it might 
make more sense to expect signed values though?

> Quantize vector values
> --
>
> Key: LUCENE-10577
> URL: https://issues.apache.org/jira/browse/LUCENE-10577
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/codecs
>Reporter: Michael Sokolov
>Priority: Major
>
> The {{KnnVectorField}} api handles vectors with 4-byte floating point values. 
> These fields can be used (via {{KnnVectorsReader}}) in two main ways:
> 1. The {{VectorValues}} iterator enables retrieving values
> 2. Approximate nearest -neighbor search
> The main point of this addition was to provide the search capability, and to 
> support that it is not really necessary to store vectors in full precision. 
> Perhaps users may also be willing to retrieve values in lower precision for 
> whatever purpose those serve, if they are able to store more samples. We know 
> that 8 bits is enough to provide a very near approximation to the same 
> recall/performance tradeoff that is achieved with the full-precision vectors. 
> I'd like to explore how we could enable 4:1 compression of these fields by 
> reducing their precision.
> A few ways I can imagine this would be done:
> 1. Provide a parallel byte-oriented API. This would allow users to provide 
> their data in reduced-precision format and give control over the quantization 
> to them. It would have a major impact on the Lucene API surface though, 
> essentially requiring us to duplicate all of the vector APIs.
> 2. Automatically quantize the stored vector data when we can. This would 
> require no or perhaps very limited change to the existing API to enable the 
> feature.
> I've been exploring (2), and what I find is that we can achieve very good 
> recall results using dot-product similarity scoring by simple linear scaling 
> + quantization of the vector values, so long as  we choose the scale that 
> minimizes the quantization error. Dot-product is amenable to this treatment 
> since vectors are required to be unit-length when used with that similarity 
> function. 
>  Even still there is variability in the ideal scale over different data sets. 
> A good choice seems to be max(abs(min-value), abs(max-value)), but of course 
> this assumes that the data set doesn't have a few outlier data points. A 
> theoretical range can be obtained by 1/sqrt(dimension), but this is only 
> useful when the samples are normally distributed. We could in theory 
> determine the ideal scale when flushing a segment and manage this 
> quantization per-segment, but then numerical error could creep in when 
> merging.
> I'll post a patch/PR with an experimental setup I've been using for 
> evaluation purposes. It is pretty self-contained and simple, but has some 
> drawbacks that need to be addressed:
> 1. No automated mechanism for determining quantization scale (it's a constant 
> that I have been playing with)
> 2. Converts from byte/float when computing dot-product instead of directly 
> computing on byte values
> I'd like to get people's feedback on the approach and whether in general we 
> should think about doing this compression under the hood, or expose a 
> byte-oriented API. Whatever we do I think a 4:1 compression ratio is pretty 
> compelling and we should pursue something.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10577) Quantize vector values

2022-05-17 Thread Michael Sokolov (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17538457#comment-17538457
 ] 

Michael Sokolov commented on LUCENE-10577:
--

Actually what I have in mind is signed byte values (-128-127), not any kind of 
8 bit floating point. But perhaps your point still holds - I don't know if 
there is hardware support for byte-arithmetic?

> Quantize vector values
> --
>
> Key: LUCENE-10577
> URL: https://issues.apache.org/jira/browse/LUCENE-10577
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/codecs
>Reporter: Michael Sokolov
>Priority: Major
>
> The {{KnnVectorField}} api handles vectors with 4-byte floating point values. 
> These fields can be used (via {{KnnVectorsReader}}) in two main ways:
> 1. The {{VectorValues}} iterator enables retrieving values
> 2. Approximate nearest -neighbor search
> The main point of this addition was to provide the search capability, and to 
> support that it is not really necessary to store vectors in full precision. 
> Perhaps users may also be willing to retrieve values in lower precision for 
> whatever purpose those serve, if they are able to store more samples. We know 
> that 8 bits is enough to provide a very near approximation to the same 
> recall/performance tradeoff that is achieved with the full-precision vectors. 
> I'd like to explore how we could enable 4:1 compression of these fields by 
> reducing their precision.
> A few ways I can imagine this would be done:
> 1. Provide a parallel byte-oriented API. This would allow users to provide 
> their data in reduced-precision format and give control over the quantization 
> to them. It would have a major impact on the Lucene API surface though, 
> essentially requiring us to duplicate all of the vector APIs.
> 2. Automatically quantize the stored vector data when we can. This would 
> require no or perhaps very limited change to the existing API to enable the 
> feature.
> I've been exploring (2), and what I find is that we can achieve very good 
> recall results using dot-product similarity scoring by simple linear scaling 
> + quantization of the vector values, so long as  we choose the scale that 
> minimizes the quantization error. Dot-product is amenable to this treatment 
> since vectors are required to be unit-length when used with that similarity 
> function. 
>  Even still there is variability in the ideal scale over different data sets. 
> A good choice seems to be max(abs(min-value), abs(max-value)), but of course 
> this assumes that the data set doesn't have a few outlier data points. A 
> theoretical range can be obtained by 1/sqrt(dimension), but this is only 
> useful when the samples are normally distributed. We could in theory 
> determine the ideal scale when flushing a segment and manage this 
> quantization per-segment, but then numerical error could creep in when 
> merging.
> I'll post a patch/PR with an experimental setup I've been using for 
> evaluation purposes. It is pretty self-contained and simple, but has some 
> drawbacks that need to be addressed:
> 1. No automated mechanism for determining quantization scale (it's a constant 
> that I have been playing with)
> 2. Converts from byte/float when computing dot-product instead of directly 
> computing on byte values
> I'd like to get people's feedback on the approach and whether in general we 
> should think about doing this compression under the hood, or expose a 
> byte-oriented API. Whatever we do I think a 4:1 compression ratio is pretty 
> compelling and we should pursue something.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Created] (LUCENE-10577) Quantize vector values

2022-05-17 Thread Michael Sokolov (Jira)

Michael Sokolov created LUCENE-10577:


 Summary: Quantize vector values
 Key: LUCENE-10577
 URL: https://issues.apache.org/jira/browse/LUCENE-10577
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/codecs
Reporter: Michael Sokolov


The {{KnnVectorField}} api handles vectors with 4-byte floating point values. 
These fields can be used (via {{KnnVectorsReader}}) in two main ways:

1. The {{VectorValues}} iterator enables retrieving values
2. Approximate nearest -neighbor search

The main point of this addition was to provide the search capability, and to 
support that it is not really necessary to store vectors in full precision. 
Perhaps users may also be willing to retrieve values in lower precision for 
whatever purpose those serve, if they are able to store more samples. We know 
that 8 bits is enough to provide a very near approximation to the same 
recall/performance tradeoff that is achieved with the full-precision vectors. 
I'd like to explore how we could enable 4:1 compression of these fields by 
reducing their precision.

A few ways I can imagine this would be done:

1. Provide a parallel byte-oriented API. This would allow users to provide 
their data in reduced-precision format and give control over the quantization 
to them. It would have a major impact on the Lucene API surface though, 
essentially requiring us to duplicate all of the vector APIs.
2. Automatically quantize the stored vector data when we can. This would 
require no or perhaps very limited change to the existing API to enable the 
feature.

I've been exploring (2), and what I find is that we can achieve very good 
recall results using dot-product similarity scoring by simple linear scaling + 
quantization of the vector values, so long as  we choose the scale that 
minimizes the quantization error. Dot-product is amenable to this treatment 
since vectors are required to be unit-length when used with that similarity 
function. 

 Even still there is variability in the ideal scale over different data sets. A 
good choice seems to be max(abs(min-value), abs(max-value)), but of course this 
assumes that the data set doesn't have a few outlier data points. A theoretical 
range can be obtained by 1/sqrt(dimension), but this is only useful when the 
samples are normally distributed. We could in theory determine the ideal scale 
when flushing a segment and manage this quantization per-segment, but then 
numerical error could creep in when merging.

I'll post a patch/PR with an experimental setup I've been using for evaluation 
purposes. It is pretty self-contained and simple, but has some drawbacks that 
need to be addressed:

1. No automated mechanism for determining quantization scale (it's a constant 
that I have been playing with)
2. Converts from byte/float when computing dot-product instead of directly 
computing on byte values

I'd like to get people's feedback on the approach and whether in general we 
should think about doing this compression under the hood, or expose a 
byte-oriented API. Whatever we do I think a 4:1 compression ratio is pretty 
compelling and we should pursue something.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-9625) Benchmark KNN search with ann-benchmarks

2022-05-17 Thread Michael Sokolov (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-9625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17538395#comment-17538395
 ] 

Michael Sokolov commented on LUCENE-9625:
-

There's no support for using an existing index; creating the index is an 
important part of the benchmark, I think? As for threading, no, it would be 
necessary to modify the test harness. But maybe you should consider 
contributing to ann-benchmarks?

> Benchmark KNN search with ann-benchmarks
> 
>
> Key: LUCENE-9625
> URL: https://issues.apache.org/jira/browse/LUCENE-9625
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Michael Sokolov
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> In addition to benchmarking with luceneutil, it would be good to be able to 
> make use of ann-benchmarks, which is publishing results from many approximate 
> knn algorithms, including the hnsw implementation from its authors. We don't 
> expect to challenge the performance of these native code libraries, however 
> it would be good to know just how far off we are.
> I started looking into this and posted a fork of ann-benchmarks that uses 
> KnnGraphTester  class to run these: 
> https://github.com/msokolov/ann-benchmarks. It's still a WIP; you have to 
> manually copy jars and the KnnGraphTester.class to the test host machine 
> rather than downloading from a distribution. KnnGraphTester needs some 
> modifications in order to support this process - this issue is mostly about 
> that.
> One thing I noticed is that some of the index builds with higher fanout 
> (efConstruction) settings time out at 2h (on an AWS c5 instance), so this is 
> concerning and I'll open a separate issue for trying to improve that.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10574) Remove O(n^2) from TieredMergePolicy or change defaults to one that doesn't do this

2022-05-17 Thread Michael Sokolov (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17538389#comment-17538389
 ] 

Michael Sokolov commented on LUCENE-10574:
--

I'm not sure if I understand, but are we seeing O(N^2) because tiny segments 
get merged into small segments, which get merged into smallish segments, and so 
on, and because the original segments were so tiny we end up merging the same 
document(s) many times?

> Remove O(n^2) from TieredMergePolicy or change defaults to one that doesn't 
> do this
> ---
>
> Key: LUCENE-10574
> URL: https://issues.apache.org/jira/browse/LUCENE-10574
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Robert Muir
>Priority: Major
>
> Remove {{floorSegmentBytes}} parameter, or change lucene's default to a merge 
> policy that doesn't merge in an O(n^2) way.
> I have the feeling it might have to be the latter, as folks seem really wed 
> to this crazy O(n^2) behavior.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10397) KnnVectorQuery doesn't tie break by doc ID

2022-05-08 Thread Michael Sokolov (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17533591#comment-17533591
 ] 

Michael Sokolov commented on LUCENE-10397:
--

I think this may have broken by LUCENE-10384 which removed suport for max-heap 
from LongHeap and mapped scores into a consistent ordering, but neglected the 
need to preserve docid ordering.

> KnnVectorQuery doesn't tie break by doc ID
> --
>
> Key: LUCENE-10397
> URL: https://issues.apache.org/jira/browse/LUCENE-10397
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Adrien Grand
>Priority: Minor
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> I was expecting KnnVectorQUery to tie-break by doc ID so that if multiple 
> documents get the same score then the ones that have the lowest doc ID would 
> get returned first, similarly to how SortField.SCORE also tie-breaks by doc 
> ID.
> However the following test fails, suggesting that it is not the case.
> {code:java}
>   public void testTieBreak() throws IOException {
> try (Directory d = newDirectory()) {
>   try (IndexWriter w = new IndexWriter(d, new IndexWriterConfig())) {
> for (int j = 0; j < 5; j++) {
>   Document doc = new Document();
>   doc.add(
>   new KnnVectorField("field", new float[] {0, 1}, 
> VectorSimilarityFunction.DOT_PRODUCT));
>   w.addDocument(doc);
> }
>   }
>   try (IndexReader reader = DirectoryReader.open(d)) {
> assertEquals(1, reader.leaves().size());
> IndexSearcher searcher = new IndexSearcher(reader);
> KnnVectorQuery query = new KnnVectorQuery("field", new float[] {2, 
> 3}, 3);
> TopDocs topHits = searcher.search(query, 3);
> assertEquals(0, topHits.scoreDocs[0].doc);
> assertEquals(1, topHits.scoreDocs[1].doc);
> assertEquals(2, topHits.scoreDocs[2].doc);
>   }
> }
>   }
> {code}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10558) Expose IOSupplier constructors in Kuromoji (and Nori?)

2022-05-04 Thread Michael Sokolov (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17531971#comment-17531971
 ] 

Michael Sokolov commented on LUCENE-10558:
--

> As workaround I would suggest to put the files into filesystem and refer to 
> them by path. For code like this it is better in my opinion.

I hear you, it seems like a clean solution, but we are operating in an 
environment that already exists where we get these resources via another 
package on our class path. Writing them to a local directory seems conceivable, 
but it's much easier and less cleanup / management required to handle as streams

> Expose IOSupplier constructors in Kuromoji (and Nori?)
> ---
>
> Key: LUCENE-10558
> URL: https://issues.apache.org/jira/browse/LUCENE-10558
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Michael Sokolov
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> When we refactored the constructors for  these resource objects used by the 
> kuromoji JapaneseTokenizer,  we (inadvertently, I expect) changed the 
> behavior for consumers that were supplying these resources on the classpath. 
> In that case, we silently replaced the custom resources with the Lucene 
> built-in ones.  I think we cannot support the old API because of Java Module 
> system restrictions, but we didn't provide any usable replacement or notice 
> either.
>  
> This issue is for exposing the new (private) constructors that accept 
> streams, and adding a notice to Migration.md to point users at them, since 
> they can be used with resources streams loaded from the classpath by the 
> caller.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Comment Edited] (LUCENE-10558) Expose IOSupplier constructors in Kuromoji (and Nori?)

2022-05-04 Thread Michael Sokolov (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17531960#comment-17531960
 ] 

Michael Sokolov edited comment on LUCENE-10558 at 5/4/22 9:53 PM:
--

Hi [~uschindler] – we have code like this:

 

 

{{ connectionCosts = new ConnectionCosts(ResourceScheme.CLASSPATH, 
DICTIONARY_PATH + "/ConnectionCosts");}}
{{ systemDictionary = new TokenInfoDictionary(ResourceScheme.CLASSPATH, 
DICTIONARY_PATH + "/TokenInfoDictionary");}}
{{ unknownDictionary = new UnknownDictionary(ResourceScheme.CLASSPATH, 
DICTIONARY_PATH + "/UnknownDictionary");}}

{{ japaneseTokenizer = new JapaneseTokenizer(attributeFactory, 
systemDictionary,}}
{{                unknownDictionary, connectionCosts, userDict, 
getDiscardPunctuationSetting(context),}}
{{                false, mode);}}
 

 

Those resource constructors started ignoring the paths we passed  in 9.1


was (Author: sokolov):
Hi [~uschindler] – we have code like this:

 

 

{{ connectionCosts = new ConnectionCosts(ResourceScheme.CLASSPATH, 
DICTIONARY_PATH + "/ConnectionCosts");}}
{{ systemDictionary = new TokenInfoDictionary(ResourceScheme.CLASSPATH, 
DICTIONARY_PATH + "/TokenInfoDictionary");}}
{{ unknownDictionary = new UnknownDictionary(ResourceScheme.CLASSPATH, 
DICTIONARY_PATH + "/UnknownDictionary");}}

{{ japaneseTokenizer = new JapaneseTokenizer(attributeFactory, 
systemDictionary,}}
{{                unknownDictionary, connectionCosts, userDict, 
getDiscardPunctuationSetting(context),}}
{{                false, mode);}}
  

 

It stops using those custom resources in 9.1

> Expose IOSupplier constructors in Kuromoji (and Nori?)
> ---
>
> Key: LUCENE-10558
> URL: https://issues.apache.org/jira/browse/LUCENE-10558
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Michael Sokolov
>Priority: Major
>
> When we refactored the constructors for  these resource objects used by the 
> kuromoji JapaneseTokenizer,  we (inadvertently, I expect) changed the 
> behavior for consumers that were supplying these resources on the classpath. 
> In that case, we silently replaced the custom resources with the Lucene 
> built-in ones.  I think we cannot support the old API because of Java Module 
> system restrictions, but we didn't provide any usable replacement or notice 
> either.
>  
> This issue is for exposing the new (private) constructors that accept 
> streams, and adding a notice to Migration.md to point users at them, since 
> they can be used with resources streams loaded from the classpath by the 
> caller.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Comment Edited] (LUCENE-10558) Expose IOSupplier constructors in Kuromoji (and Nori?)

2022-05-04 Thread Michael Sokolov (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17531960#comment-17531960
 ] 

Michael Sokolov edited comment on LUCENE-10558 at 5/4/22 9:52 PM:
--

Hi [~uschindler] – we have code like this:

 

 

{{ connectionCosts = new ConnectionCosts(ResourceScheme.CLASSPATH, 
DICTIONARY_PATH + "/ConnectionCosts");}}
{{ systemDictionary = new TokenInfoDictionary(ResourceScheme.CLASSPATH, 
DICTIONARY_PATH + "/TokenInfoDictionary");}}
{{ unknownDictionary = new UnknownDictionary(ResourceScheme.CLASSPATH, 
DICTIONARY_PATH + "/UnknownDictionary");}}

{{ japaneseTokenizer = new JapaneseTokenizer(attributeFactory, 
systemDictionary,}}
{{                unknownDictionary, connectionCosts, userDict, 
getDiscardPunctuationSetting(context),}}
{{                false, mode);}}
  

 

It stops using those custom resources in 9.1


was (Author: sokolov):
Hi [~uschindler] – we have code like this:

 

{{

            connectionCosts = new ConnectionCosts(ResourceScheme.CLASSPATH, 
DICTIONARY_PATH + "/ConnectionCosts");
            systemDictionary = new 
TokenInfoDictionary(ResourceScheme.CLASSPATH, DICTIONARY_PATH + 
"/TokenInfoDictionary");
            unknownDictionary = new UnknownDictionary(ResourceScheme.CLASSPATH, 
DICTIONARY_PATH + "/UnknownDictionary");

            japaneseTokenizer = new JapaneseTokenizer(attributeFactory, 
systemDictionary,
                unknownDictionary, connectionCosts, userDict, 
getDiscardPunctuationSetting(context),
                false, mode);
        }}

 

It stops using those custom resources in 9.1

> Expose IOSupplier constructors in Kuromoji (and Nori?)
> ---
>
> Key: LUCENE-10558
> URL: https://issues.apache.org/jira/browse/LUCENE-10558
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Michael Sokolov
>Priority: Major
>
> When we refactored the constructors for  these resource objects used by the 
> kuromoji JapaneseTokenizer,  we (inadvertently, I expect) changed the 
> behavior for consumers that were supplying these resources on the classpath. 
> In that case, we silently replaced the custom resources with the Lucene 
> built-in ones.  I think we cannot support the old API because of Java Module 
> system restrictions, but we didn't provide any usable replacement or notice 
> either.
>  
> This issue is for exposing the new (private) constructors that accept 
> streams, and adding a notice to Migration.md to point users at them, since 
> they can be used with resources streams loaded from the classpath by the 
> caller.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10558) Expose IOSupplier constructors in Kuromoji (and Nori?)

2022-05-04 Thread Michael Sokolov (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17531960#comment-17531960
 ] 

Michael Sokolov commented on LUCENE-10558:
--

Hi [~uschindler] – we have code like this:

 

{{

            connectionCosts = new ConnectionCosts(ResourceScheme.CLASSPATH, 
DICTIONARY_PATH + "/ConnectionCosts");
            systemDictionary = new 
TokenInfoDictionary(ResourceScheme.CLASSPATH, DICTIONARY_PATH + 
"/TokenInfoDictionary");
            unknownDictionary = new UnknownDictionary(ResourceScheme.CLASSPATH, 
DICTIONARY_PATH + "/UnknownDictionary");

            japaneseTokenizer = new JapaneseTokenizer(attributeFactory, 
systemDictionary,
                unknownDictionary, connectionCosts, userDict, 
getDiscardPunctuationSetting(context),
                false, mode);
        }}

 

It stops using those custom resources in 9.1

> Expose IOSupplier constructors in Kuromoji (and Nori?)
> ---
>
> Key: LUCENE-10558
> URL: https://issues.apache.org/jira/browse/LUCENE-10558
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Michael Sokolov
>Priority: Major
>
> When we refactored the constructors for  these resource objects used by the 
> kuromoji JapaneseTokenizer,  we (inadvertently, I expect) changed the 
> behavior for consumers that were supplying these resources on the classpath. 
> In that case, we silently replaced the custom resources with the Lucene 
> built-in ones.  I think we cannot support the old API because of Java Module 
> system restrictions, but we didn't provide any usable replacement or notice 
> either.
>  
> This issue is for exposing the new (private) constructors that accept 
> streams, and adding a notice to Migration.md to point users at them, since 
> they can be used with resources streams loaded from the classpath by the 
> caller.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Created] (LUCENE-10559) Add preFilter/postFilter options to KnnGraphTester

2022-05-04 Thread Michael Sokolov (Jira)

Michael Sokolov created LUCENE-10559:


 Summary: Add preFilter/postFilter options to KnnGraphTester
 Key: LUCENE-10559
 URL: https://issues.apache.org/jira/browse/LUCENE-10559
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Michael Sokolov


We want to be able to test the efficacy of pre-filtering in KnnVectorQuery: if 
you (say) want the top K nearest neighbors subject to a constraint Q, are you 
better off over-selecting (say 2K) top hits and *then* filtering 
(post-filtering), or incorporating the filtering into the query 
(pre-filtering). How does it depend on the selectivity of the filter?

I think we can get a reasonable testbed by generating a uniform random filter 
with some selectivity (that is consistent and repeatable). Possibly we'd also 
want to try filters that are correlated with index order, but it seems they'd 
be unlikely to be correlated with vector values in a way that the graph 
structure would notice, so random is a pretty good starting point for this.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Created] (LUCENE-10558) Expose IOSupplier constructors in Kuromoji (and Nori?)

2022-05-04 Thread Michael Sokolov (Jira)

Michael Sokolov created LUCENE-10558:


 Summary: Expose IOSupplier constructors in Kuromoji 
(and Nori?)
 Key: LUCENE-10558
 URL: https://issues.apache.org/jira/browse/LUCENE-10558
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Michael Sokolov


When we refactored the constructors for  these resource objects used by the 
kuromoji JapaneseTokenizer,  we (inadvertently, I expect) changed the behavior 
for consumers that were supplying these resources on the classpath. In that 
case, we silently replaced the custom resources with the Lucene built-in ones.  
I think we cannot support the old API because of Java Module system 
restrictions, but we didn't provide any usable replacement or notice either.

 

This issue is for exposing the new (private) constructors that accept streams, 
and adding a notice to Migration.md to point users at them, since they can be 
used with resources streams loaded from the classpath by the caller.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10335) IOUtils.getDecodingReader(Class, String) is broken with modules

2022-05-04 Thread Michael Sokolov (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17531851#comment-17531851
 ] 

Michael Sokolov commented on LUCENE-10335:
--

Thanks, Uwe, you're right I confused the issues. Maybe it was LUCENE-10400 that 
overrode the old behavior? And you are right, I am currently running into an 
issue with Kuromoji, not any other analyzer. Thank you for pointing out that we 
already have an InputStream constructor. I think we just need to make it public 
to support the use case we broke.

Regarding migrate.txt, did you mean Migrate.md? Anyway I don't think there is 
any mention there of the change in behavior to ConnectionCosts and 
TokenInfoDictionary  constructors.

 

> IOUtils.getDecodingReader(Class, String) is broken with modules
> --
>
> Key: LUCENE-10335
> URL: https://issues.apache.org/jira/browse/LUCENE-10335
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Dawid Weiss
>Assignee: Uwe Schindler
>Priority: Major
> Fix For: 9.1, 10.0 (main)
>
> Attachments: LUCENE-10335-1.patch, LUCENE-10335.patch, Screenshot 
> from 2021-12-25 18-04-55.png
>
>  Time Spent: 18h 40m
>  Remaining Estimate: 0h
>
> This method calls clazz.getResourceAsStream() but in a modular application 
> the method won't see any of the resources in clazz's module, causing an NPE. 
> We should deprecate or even remove this method entirely, leaving only 
> getDecodingReader(InputStream) and opening the resource on the caller's side.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10335) IOUtils.getDecodingReader(Class, String) is broken with modules

2022-05-04 Thread Michael Sokolov (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17531809#comment-17531809
 ] 

Michael Sokolov commented on LUCENE-10335:
--

This issue broke backwards compatibility. I think we really ought to have 
mentioned in CHANGES.txt.Specifically what happened was that previously:

    new ConnectionCosts(BinaryDictionary.ResourceScheme.CLASSPATH, path)

would load the resource from the given path, but now the path is ignored and a 
default path is substituted in its place. I think the same is true for the 
other external resource files.

I guess we can add a CHANGES entry now to warn people since anyone that loads 
these resources from the classpath will have trouble when upgrading, since 
their dictionaries will be silently overridden with Lucene's built-in ones, 
causing subtle (or not so subtle) changes in tokenization. I can add an entry 
there, but first I'm trying to understand what options we have.

I the classpath mechanism we used before is no longer feasible because of Java 
module system rules, how can we load our resources that are shipped in jars?  
Maybe we could provide a constructor accepting InputStreams so the caller can 
open up the classpath resources?

> IOUtils.getDecodingReader(Class, String) is broken with modules
> --
>
> Key: LUCENE-10335
> URL: https://issues.apache.org/jira/browse/LUCENE-10335
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Dawid Weiss
>Assignee: Uwe Schindler
>Priority: Major
> Fix For: 9.1, 10.0 (main)
>
> Attachments: LUCENE-10335-1.patch, LUCENE-10335.patch, Screenshot 
> from 2021-12-25 18-04-55.png
>
>  Time Spent: 18h 40m
>  Remaining Estimate: 0h
>
> This method calls clazz.getResourceAsStream() but in a modular application 
> the method won't see any of the resources in clazz's module, causing an NPE. 
> We should deprecate or even remove this method entirely, leaving only 
> getDecodingReader(InputStream) and opening the resource on the caller's side.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10552) KnnVectorQuery has incorrect equals/ hashCode

2022-05-02 Thread Michael Sokolov (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10552?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17530801#comment-17530801
 ] 

Michael Sokolov commented on LUCENE-10552:
--

Yes, I think so too. Nice catch - please feel free to submit a patch

> KnnVectorQuery has incorrect equals/ hashCode
> -
>
> Key: LUCENE-10552
> URL: https://issues.apache.org/jira/browse/LUCENE-10552
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Lu Xugang
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> filter query could impact document-filtering properties, so maybe it should 
> be a condition in Query#equals and Query#hashCode ?



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10541) What to do about massive terms in our Wikipedia EN LineFileDocs?

2022-04-29 Thread Michael Sokolov (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17530144#comment-17530144
 ] 

Michael Sokolov commented on LUCENE-10541:
--

I think the probability of choosing a particular item is based on the length of 
the item just prior (ignoring the skip points, which I don't understand what 
Mike is talking about!). But if there is no correlation among length(line N) 
and length(line N+1) we could probably ignore that. In other words, the item 
following the longest line L is the most likely item to be chosen. However its 
expected length is no different from the expected length of all the lines, 
right? In which case I don't think the seek-and-scan method changes the 
probabilities at all. So I think we can simply look at the number of lines of a 
given length (or above some threshold) and divide by the total number of lines 
to get the P(line length).

> What to do about massive terms in our Wikipedia EN LineFileDocs?
> 
>
> Key: LUCENE-10541
> URL: https://issues.apache.org/jira/browse/LUCENE-10541
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Michael McCandless
>Priority: Major
>  Time Spent: 3h 20m
>  Remaining Estimate: 0h
>
> Spinoff from this fun build failure that [~dweiss] root caused: 
> [https://lucene.markmail.org/thread/pculfuazll4oebra]
> Thank you and sorry [~dweiss]!!
> This test failure happened because the test case randomly indexed a chunk of 
> the nightly (many GBs) LineFileDocs Wikipedia file that had a massive (> IW's 
> ~32 KB limit) term, and IW threw an {{IllegalArgumentException}} failing the 
> test.
> It's crazy that it took so long for Lucene's randomized tests to discover 
> this too-massive term in Lucene's nightly benchmarks.  It's like searching 
> for Nessie, or 
> [SETI|https://en.wikipedia.org/wiki/Search_for_extraterrestrial_intelligence].
> We need to prevent such false failures, somehow, and there are multiple 
> options: fix this test to not use {{{}LineFileDocs{}}}, remove all "massive" 
> terms from all tests (nightly and git) {{{}LineFileDocs{}}}, fix 
> {{MockTokenizer}} to trim such ridiculous terms (I think this is the best 
> option?), ...



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10519) ThreadLocal.remove under G1GC takes 100% CPU

2022-04-26 Thread Michael Sokolov (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17528375#comment-17528375
 ] 

Michael Sokolov commented on LUCENE-10519:
--

I looked at the patch. It's hard to evaluate in isolation. It would be helpful 
to see some test results demonstrating the benefit, and no degradation to 
existing use cases. I think it's worth looking at how we use this object if we 
want to improve it. I found only two uses; one in SegmentCoreReaders, where we 
have two of them (I guess it's two per segment?) and then some similar usage in 
MergingCodecReader. I wonder if instead of working around this issue in GC, we 
could more explicitly null out these pointers to the various format readers 
when the enclosing IndexReader is closed?

> ThreadLocal.remove under G1GC takes 100% CPU
> 
>
> Key: LUCENE-10519
> URL: https://issues.apache.org/jira/browse/LUCENE-10519
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/other
>Affects Versions: 8.9, 8.10.1, 8.11.1
> Environment: Elasticsearch v7.16.0
> OpenJDK v11
>Reporter: Lucifer Boice
>Priority: Critical
>  Labels: CloseableThreadLocal
>  Time Spent: 2h 40m
>  Remaining Estimate: 0h
>
> h2. Problem
> 
> {*}org.apache.lucene.util.CloseableThreadLocal{*}(which is using 
> {*}ThreadLocal>{*}) may still have a flaw under G1GC. There 
> is a single ThreadLocalMap stored for each thread, which all ThreadLocals 
> share, and that master map only periodically purges stale entries. When we 
> close a CloseableThreadLocal, we only take care of the current thread right 
> now, others will be taken care of via the WeakReferences. Under G1GC, the 
> WeakReferences of other threads may not be recycled even after several rounds 
> of mix-GC. The ThreadLocalMap may grow very large, it can take an arbitrarily 
> long amount of CPU and time to iterate the things you had stored in it.
> Hot thread of elasticsearch:
> {code:java}
> ::: 
> {x}{lCj7LcVnT328KHcJRd57yg}{WPiNCbk0R0SIKxg4-w3wew}{}{}
>Hot threads at 2020-04-12T05:25:10.224Z, interval=500ms, busiestThreads=3, 
> ignoreIdleThreads=true:
>
>105.3% (526.5ms out of 500ms) cpu usage by thread 
> 'elasticsearch[][bulk][T#31]'
>  10/10 snapshots sharing following 34 elements
>
> java.lang.ThreadLocal$ThreadLocalMap.expungeStaleEntry(ThreadLocal.java:627)
>java.lang.ThreadLocal$ThreadLocalMap.remove(ThreadLocal.java:509)
>java.lang.ThreadLocal$ThreadLocalMap.access$200(ThreadLocal.java:308)
>java.lang.ThreadLocal.remove(ThreadLocal.java:224)
>
> java.util.concurrent.locks.ReentrantReadWriteLock$Sync.tryReleaseShared(ReentrantReadWriteLock.java:426)
>
> java.util.concurrent.locks.AbstractQueuedSynchronizer.releaseShared(AbstractQueuedSynchronizer.java:1349)
>
> java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock.unlock(ReentrantReadWriteLock.java:881)
>
> org.elasticsearch.common.util.concurrent.ReleasableLock.close(ReleasableLock.java:49)
>
> org.elasticsearch.index.engine.InternalEngine.$closeResource(InternalEngine.java:356)
>
> org.elasticsearch.index.engine.InternalEngine.delete(InternalEngine.java:1272)
>org.elasticsearch.index.shard.IndexShard.delete(IndexShard.java:812)
>
> org.elasticsearch.index.shard.IndexShard.applyDeleteOperation(IndexShard.java:779)
>
> org.elasticsearch.index.shard.IndexShard.applyDeleteOperationOnReplica(IndexShard.java:750)
>
> org.elasticsearch.action.bulk.TransportShardBulkAction.performOpOnReplica(TransportShardBulkAction.java:623)
>
> org.elasticsearch.action.bulk.TransportShardBulkAction.performOnReplica(TransportShardBulkAction.java:577)
>  {code}
> h2. Solution
> 
> This bug does not reproduce under CMS. It can be reproduced under G1GC always.
> In fact, *CloseableThreadLocal* doesn't need to store entry twice in the 
> hardRefs And ThreadLocals. Remove ThreadLocal from CloseableThreadLocal so 
> that we would not be affected by the serious flaw of Java's built-in 
> ThreadLocal. 
> h2. See also
> 
> [https://github.com/elastic/elasticsearch/issues/56766]
> [https://bugs.openjdk.java.net/browse/JDK-8182982]
> [https://discuss.elastic.co/t/indexing-performance-degrading-over-time/40229/44]
>  



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10527) Use bigger maxConn for last layer in HNSW

2022-04-26 Thread Michael Sokolov (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17528362#comment-17528362
 ] 

Michael Sokolov commented on LUCENE-10527:
--

Yes, good find! Agree that users may have an expectation, and there is some 
improvement, too - let's do it

> Use bigger maxConn for last layer in HNSW
> -
>
> Key: LUCENE-10527
> URL: https://issues.apache.org/jira/browse/LUCENE-10527
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Julie Tibshirani
>Priority: Minor
> Attachments: image-2022-04-20-14-53-58-484.png
>
>
> Recently I was rereading the HNSW paper 
> ([https://arxiv.org/pdf/1603.09320.pdf)] and noticed that they suggest using 
> a different maxConn for the upper layers vs. the bottom one (which contains 
> the full neighborhood graph). Specifically, they suggest using maxConn=M for 
> upper layers and maxConn=2*M for the bottom. This differs from what we do, 
> which is to use maxConn=M for all layers.
> I tried updating our logic using a hacky patch, and noticed an improvement in 
> latency for higher recall values (which is consistent with the paper's 
> observation):
> *Results on glove-100-angular*
> Parameters: M=32, efConstruction=100
> !image-2022-04-20-14-53-58-484.png|width=400,height=367!
> As we'd expect, indexing becomes a bit slower:
> {code:java}
> Baseline: Indexed 1183514 documents in 733s 
> Candidate: Indexed 1183514 documents in 948s{code}
> When we benchmarked Lucene HNSW against hnswlib in LUCENE-9937, we noticed a 
> big difference in recall for the same settings of M and efConstruction. (Even 
> adding graph layers in LUCENE-10054 didn't really affect recall.) With this 
> change, the recall is now very similar:
> *Results on glove-100-angular*
> Parameters: M=32, efConstruction=100
> {code:java}
> num_candsApproach  Recall 
> QPS
> 10   luceneknn dim=100 {'M': 32, 'efConstruction': 100}0.563  
>4410.499
> 50   luceneknn dim=100 {'M': 32, 'efConstruction': 100}0.798  
>1956.280
> 100  luceneknn dim=100 {'M': 32, 'efConstruction': 100}0.862  
>1209.734
> 100  luceneknn dim=100 {'M': 32, 'efConstruction': 100}0.958  
> 341.428
> 800  luceneknn dim=100 {'M': 32, 'efConstruction': 100}0.974  
> 230.396
> 1000 luceneknn dim=100 {'M': 32, 'efConstruction': 100}0.980  
> 188.757
> 10   hnswlib ({'M': 32, 'efConstruction': 100})0.552  
>   16745.433
> 50   hnswlib ({'M': 32, 'efConstruction': 100})0.794  
>5738.468
> 100  hnswlib ({'M': 32, 'efConstruction': 100})0.860  
>3336.386
> 500  hnswlib ({'M': 32, 'efConstruction': 100})0.956  
> 832.982
> 800  hnswlib ({'M': 32, 'efConstruction': 100})0.973  
> 541.097
> 1000 hnswlib ({'M': 32, 'efConstruction': 100})0.979  
> 442.163
> {code}
> I think it'd be nice update to maxConn so that we faithfully implement the 
> paper's algorithm. This is probably least surprising for users, and I don't 
> see a strong reason to take a different approach from the paper? Let me know 
> what you think!



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-8580) Make segment merging parallel in SegmentMerger

2022-04-26 Thread Michael Sokolov (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-8580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17528360#comment-17528360
 ] 

Michael Sokolov commented on LUCENE-8580:
-

It seems we have generally stuck with one or a handful of files per format. 
Presumably we don't want to fragment into many small files in order to enable 
the OS to handle I/O more optimally? I think the key is not to impact search 
performance. But maybe start by writing many files in parallel just to see what 
is the best we can achieve for parallel merging with this approach and worry 
about connecting them together later? It shouldn't be hard to do the stitching, 
just block copies + rewriting the metadata?

> Make segment merging parallel in SegmentMerger
> --
>
> Key: LUCENE-8580
> URL: https://issues.apache.org/jira/browse/LUCENE-8580
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Dawid Weiss
>Assignee: Dawid Weiss
>Priority: Minor
> Attachments: LUCENE-8580.patch
>
>
> A placeholder issue stemming from the discussion on the mailing list [1]. Not 
> of any high priority.
> At the moment any merging from N segments into one will happen sequentially 
> for each data structure involved in a segment (postings, norms, points, 
> etc.). If the input segments are large, the CPU (and I/O) are mostly unused 
> and the process takes a long time. 
> Merging of these data structures is mostly independent of each other, so it'd 
> be interesting to see if we can speed things up by allowing them to run 
> concurrently. I investigated this on a 40GB index with 22 segments, 
> force-merging this into 1 segment (of similar size). Quick and dirty patch 
> attached.
> I see some improvement, although it's not by much; the largest component 
> dominates everything else.
> Results from an 8-core CPU.
> Before:
> {code}
> SM 0 [2018-11-30T09:21:11.662Z; main]: 347237 msec to merge stored fields 
> [41922110 docs]
> SM 0 [2018-11-30T09:21:18.236Z; main]: 6562 msec to merge norms [41922110 
> docs]
> SM 0 [2018-11-30T09:33:53.746Z; main]: 755507 msec to merge postings 
> [41922110 docs]
> SM 0 [2018-11-30T09:33:53.746Z; main]: 0 msec to merge doc values [41922110 
> docs]
> SM 0 [2018-11-30T09:33:53.746Z; main]: 0 msec to merge points [41922110 docs]
> SM 0 [2018-11-30T09:33:53.746Z; main]: 7 msec to write field infos [41922110 
> docs]
> IW 0 [2018-11-30T09:33:56.124Z; main]: merge time 1112238 msec for 41922110 
> docs
> {code}
> After:
> {code}
> SM 0 [2018-11-30T10:16:42.179Z; ForkJoinPool.commonPool-worker-1]: 8189 msec 
> to merge norms
> SM 0 [2018-11-30T10:16:42.195Z; ForkJoinPool.commonPool-worker-3]: 0 msec to 
> merge doc values
> SM 0 [2018-11-30T10:16:42.195Z; ForkJoinPool.commonPool-worker-3]: 0 msec to 
> merge points
> SM 0 [2018-11-30T10:16:42.211Z; ForkJoinPool.commonPool-worker-1]: merge 
> store matchedCount=22 vs 22
> SM 0 [2018-11-30T10:23:24.574Z; ForkJoinPool.commonPool-worker-1]: 402381 
> msec to merge stored fields [41922110 docs]
> SM 0 [2018-11-30T10:32:20.862Z; ForkJoinPool.commonPool-worker-2]: 938668 
> msec to merge postings
> IW 0 [2018-11-30T10:32:23.513Z; main]: merge time  950249 msec for 41922110 
> docs
> {code}
> Ideally, one would need to push forkjoin into individual subroutines so that, 
> for example, postings utilize concurrency when merging (pulling blocks of 
> terms concurrently from the input, calculating statistics, etc. and then 
> pushing in an ordered fashion to the codec). 
> [1] https://markmail.org/thread/dtejwq42qagykeac



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-9269) Blended queries with boolean rewrite can result in inconsistent scores

2022-04-11 Thread Michael Sokolov (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-9269?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17520560#comment-17520560
 ] 

Michael Sokolov commented on LUCENE-9269:
-

# Does the checkBoosts test case you refer to fail when you attempt your 
change? If so, please address it. Otherwise, I think it should be fixed 
separately
 # It's OK for two different queries to behave the same, and I don't see how 
you can know that they will in this case, so they should compare different I 
think
 # again, toString() is not guaranteed to be different for different queries; I 
think it's OK

> Blended queries with boolean rewrite can result in inconsistent scores
> --
>
> Key: LUCENE-9269
> URL: https://issues.apache.org/jira/browse/LUCENE-9269
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/search
>Affects Versions: 8.4
>Reporter: Michele Palmia
>Priority: Minor
> Attachments: LUCENE-9269-test.patch
>
>
> If two blended queries are should clauses of a boolean query and are built so 
> that
>  * some of their terms are the same
>  * their rewrite method is BlendedTermQuery.BOOLEAN_REWRITE
> the docFreq for the overlapping terms used for scoring is picked as follow:
>  # if the overlapping terms are not boosted, the df of the term in the first 
> blended query is used
>  # if any of the overlapping terms is boosted, the df is picked at (what 
> looks like) random.
> A few examples using a field with 2 terms: f:a (df: 2), and f:b (df: 3).
> {code:java}
> a)
> Blended(f:a f:b) Blended (f:a)
> df: 3 df: 2
> gets rewritten to:
> (f:a)^2.0 (f:b)
> df: 3  df:2
> b)
> Blended(f:a) Blended(f:a f:b)
> df: 2df: 3
> gets rewritten to:
> (f:a)^2.0 (f:b)
>  df: 2 df:2
> c)
> Blended(f:a f:b^0.66) Blended (f:a^0.75)
> df: 3  df: 2
> gets rewritten to:
> (f:a)^1.75 (f:b)^0.66
>  df:?   df:2
> {code}
> with ? either 2 or 3, depending on the run.
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10292) AnalyzingInfixSuggester thread safety: lookup() fails during (re)build()

2022-04-07 Thread Michael Sokolov (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17519219#comment-17519219
 ] 

Michael Sokolov commented on LUCENE-10292:
--

It's surprising it took so long for someone to stumble over this. Maybe because 
most (as I used to do) rebuild their suggesters off line as part of a batch 
build process? Anyway I looked over the patch and didn't see any problem except 
a typo in tests "testDurringReBuild" should probably be "testDuringRebuild"? 
Thanks for the nice tests!

> AnalyzingInfixSuggester thread safety: lookup() fails during (re)build()
> 
>
> Key: LUCENE-10292
> URL: https://issues.apache.org/jira/browse/LUCENE-10292
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Chris M. Hostetter
>Assignee: Chris M. Hostetter
>Priority: Major
> Attachments: LUCENE-10292-1.patch, LUCENE-10292-2.patch, 
> LUCENE-10292.patch
>
>
> I'm filing this based on anecdotal information from a Solr user w/o 
> experiencing it first hand (and I don't have a test case to demonstrate it) 
> but based on a reading of the code the underlying problem seems self 
> evident...
> With all other Lookup implementations I've examined, it is possible to call 
> {{lookup()}} regardless of whether another thread is concurrently calling 
> {{build()}} – in all cases I've seen, it is even possible to call 
> {{lookup()}} even if {{build()}} has never been called: the result is just an 
> "empty" {{List}} 
> Typically this is works because the {{build()}} method uses temporary 
> datastructures until it's "build logic" is complete, at which point it 
> atomically replaces the datastructures used by the {{lookup()}} method.   In 
> the case of {{AnalyzingInfixSuggester}} however, the {{build()}} method 
> starts by closing & null'ing out the {{protected SearcherManager 
> searcherMgr}} (which it only populates again once it's completed building up 
> it's index) and then the lookup method starts with...
> {code:java}
> if (searcherMgr == null) {
>   throw new IllegalStateException("suggester was not built");
> }
> {code}
> ... meaning it is unsafe to call {{AnalyzingInfixSuggester.lookup()}} in any 
> situation where another thread may be calling 
> {{AnalyzingInfixSuggester.build()}}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Created] (LUCENE-10504) KnnGraphTester should use KnnVectorQuery

2022-04-06 Thread Michael Sokolov (Jira)

Michael Sokolov created LUCENE-10504:


 Summary: KnnGraphTester should use KnnVectorQuery
 Key: LUCENE-10504
 URL: https://issues.apache.org/jira/browse/LUCENE-10504
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Michael Sokolov


to get a more realistic picture, and to track developments in the query 
implementation, the tester should use that rather than implementing its own 
per-segment search and merging logic.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-9625) Benchmark KNN search with ann-benchmarks

2022-04-05 Thread Michael Sokolov (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-9625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17517495#comment-17517495
 ] 

Michael Sokolov commented on LUCENE-9625:
-

So, clearly it's not finding KnnGraphTester. Probably it needs to go in the 
proper folder structure (org/apache/lucene/util/KnnGraphTester.class) relative 
to the classpath root. TBH I don't remember the way files are set up in this 
ann-benchmarks environment, but I hope that helps you figure it out!

> Benchmark KNN search with ann-benchmarks
> 
>
> Key: LUCENE-9625
> URL: https://issues.apache.org/jira/browse/LUCENE-9625
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Michael Sokolov
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> In addition to benchmarking with luceneutil, it would be good to be able to 
> make use of ann-benchmarks, which is publishing results from many approximate 
> knn algorithms, including the hnsw implementation from its authors. We don't 
> expect to challenge the performance of these native code libraries, however 
> it would be good to know just how far off we are.
> I started looking into this and posted a fork of ann-benchmarks that uses 
> KnnGraphTester  class to run these: 
> https://github.com/msokolov/ann-benchmarks. It's still a WIP; you have to 
> manually copy jars and the KnnGraphTester.class to the test host machine 
> rather than downloading from a distribution. KnnGraphTester needs some 
> modifications in order to support this process - this issue is mostly about 
> that.
> One thing I noticed is that some of the index builds with higher fanout 
> (efConstruction) settings time out at 2h (on an AWS c5 instance), so this is 
> concerning and I'll open a separate issue for trying to improve that.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-9625) Benchmark KNN search with ann-benchmarks

2022-04-05 Thread Michael Sokolov (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-9625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17517471#comment-17517471
 ] 

Michael Sokolov commented on LUCENE-9625:
-

> As you've mentioned that jars should be copied but not very clear that which 
> all Lucene jars we've to copy?

I think lucene-core should be enough. What have you tried?

> Benchmark KNN search with ann-benchmarks
> 
>
> Key: LUCENE-9625
> URL: https://issues.apache.org/jira/browse/LUCENE-9625
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Michael Sokolov
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> In addition to benchmarking with luceneutil, it would be good to be able to 
> make use of ann-benchmarks, which is publishing results from many approximate 
> knn algorithms, including the hnsw implementation from its authors. We don't 
> expect to challenge the performance of these native code libraries, however 
> it would be good to know just how far off we are.
> I started looking into this and posted a fork of ann-benchmarks that uses 
> KnnGraphTester  class to run these: 
> https://github.com/msokolov/ann-benchmarks. It's still a WIP; you have to 
> manually copy jars and the KnnGraphTester.class to the test host machine 
> rather than downloading from a distribution. KnnGraphTester needs some 
> modifications in order to support this process - this issue is mostly about 
> that.
> One thing I noticed is that some of the index builds with higher fanout 
> (efConstruction) settings time out at 2h (on an AWS c5 instance), so this is 
> concerning and I'll open a separate issue for trying to improve that.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10425) count aggregation optimization inside one segment in log scenario

2022-03-07 Thread Michael Sokolov (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10425?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17502254#comment-17502254
 ] 

Michael Sokolov commented on LUCENE-10425:
--

> Thinking through this a bit more, I imagine that this is the reason why you 
> made this API optional.

Once you implement this API, if it is indeed useful, someone starts relying on 
it. Then you can no longer just go ahead and make a new implementation that 
doesn't support it (or makes it much slower), at least not without a fight: 
it's harder to remove something than not to add it in the first place. I don't 
think the optional idea is real.

We need to make a case for this on the merits - if it enables powerful 
optimizations across a wide range of applications, then so be it, perhaps it's 
worth the future constraints. I'm not sure though

> count aggregation optimization inside one segment in log scenario
> -
>
> Key: LUCENE-10425
> URL: https://issues.apache.org/jira/browse/LUCENE-10425
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: core/search
>Reporter: jianping weng
>Priority: Major
>  Time Spent: 4.5h
>  Remaining Estimate: 0h
>
> In log scenario, we usually want to know the doc count of documents between 
> every time intervals. One possible optimized method is to sort the docuemt in 
> ascend order according to @timestamp field in one segment. then we can use    
> this pr [https://github.com/apache/lucene/pull/687] to find out the min/max 
> docId in on time interval.
> If there is no other filter query, the doc count of one time interval is (max 
> docId- min docId +1)
> if there is only one another term filter query, we can use this pr 
> [https://github.com/apache/lucene/pull/688 
> |https://github.com/apache/lucene/pull/688]to get the diff value of index, 
> when we call advance(minId) and advance(maxId), the diff value is also the 
> doc count of one time interval
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10421) Non-deterministic results from KnnVectorQuery?

2022-02-14 Thread Michael Sokolov (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17492056#comment-17492056
 ] 

Michael Sokolov commented on LUCENE-10421:
--

Yeah, there is really no value I can see in varying this random seed. +1 to 
introduce a constant

> Non-deterministic results from KnnVectorQuery?
> --
>
> Key: LUCENE-10421
> URL: https://issues.apache.org/jira/browse/LUCENE-10421
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Michael McCandless
>Priority: Major
>
> [Nightly benchmarks|https://home.apache.org/~mikemccand/lucenebench/] have 
> been upset for the past ~1.5 weeks because it looks like {{KnnVectorQuery}} 
> is giving slightly different results on every run, even on an identical 
> (deterministically constructed – single thread indexing, flush by doc count, 
> {{{}SerialMergeSchedule{}}}, {{{}LogDocCountMergePolicy{}}}, etc.) index each 
> night.  It produces failures like this, which then abort the benchmark to 
> help us catch any recent accidental bug that alters our precise top N search 
> hits and scores:
> {noformat}
>  Traceback (most recent call last):
>  File “/l/util.nightly/src/python/nightlyBench.py”, line 2177, in 
>   run()
>  File “/l/util.nightly/src/python/nightlyBench.py”, line 1225, in run
>   raise RuntimeError(‘search result differences: %s’ % str(errors))
> RuntimeError: search result differences: 
> [“query=KnnVectorQuery:vector[-0.07267512,...][10] filter=None sort=None 
> groupField=None hitCount=10: hit 4 has wrong field/score value ([20844660], 
> ‘0.92060816’) vs ([254438\
> 06], ‘0.920046’)“, “query=KnnVectorQuery:vector[-0.12073054,...][10] 
> filter=None sort=None groupField=None hitCount=10: hit 7 has wrong 
> field/score value ([25501982], ‘0.99630797’) vs ([13688085], ‘0.9961489’)“, 
> “qu\
> ery=KnnVectorQuery:vector[0.02227773,...][10] filter=None sort=None 
> groupField=None hitCount=10: hit 0 has wrong field/score value ([4741915], 
> ‘0.9481132’) vs ([14220828], ‘0.9579846’)“, “query=KnnVectorQuery:vector\
> [0.024077624,...][10] filter=None sort=None groupField=None hitCount=10: hit 
> 0 has wrong field/score value ([7472373], ‘0.8460249’) vs ([12577825], 
> ‘0.8378446’)“]{noformat}
> At first I thought this might be expected because of the recent (awesome!!) 
> improvements to HNSW, so I tried to simply "regold".  But the regold did not 
> "take", so it indeed looks like there is some non-determinism here.
> I pinged [~msoko...@gmail.com] and he found this random seeding that is most 
> likely the cause?
> {noformat}
> public final class HnswGraphBuilder {
>   /** Default random seed for level generation * */
>   private static final long DEFAULT_RAND_SEED = System.currentTimeMillis(); 
> {noformat}
> Can we somehow make this deterministic instead?  Or maybe the nightly 
> benchmarks could somehow pass something in to make results deterministic for 
> benchmarking?  Or ... we could also relax the benchmarks to accept 
> non-determinism for {{KnnVectorQuery}} task?



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10421) Non-deterministic results from KnnVectorQuery?

2022-02-14 Thread Michael Sokolov (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17492028#comment-17492028
 ] 

Michael Sokolov commented on LUCENE-10421:
--

(seems to depend on the query's toString()):

 

this was the old KnnQuery.toString():
  @Override
  public String toString(String field) {
    return "[" + vector[0] + ",...]>";
  }


Mike Sokolov  9:34 AM
yah OK

Michael McCandless  9:34 AM
aha!  the new one:
  @Override
  public String toString(String field) {
    return getClass().getSimpleName() + ":" + this.field + "[" + target[0] + 
",...][" + k + "]";
  }

> Non-deterministic results from KnnVectorQuery?
> --
>
> Key: LUCENE-10421
> URL: https://issues.apache.org/jira/browse/LUCENE-10421
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Michael McCandless
>Priority: Major
>
> [Nightly benchmarks|https://home.apache.org/~mikemccand/lucenebench/] have 
> been upset for the past ~1.5 weeks because it looks like {{KnnVectorQuery}} 
> is giving slightly different results on every run, even on an identical 
> (deterministically constructed – single thread indexing, flush by doc count, 
> {{{}SerialMergeSchedule{}}}, {{{}LogDocCountMergePolicy{}}}, etc.) index each 
> night.  It produces failures like this, which then abort the benchmark to 
> help us catch any recent accidental bug that alters our precise top N search 
> hits and scores:
> {noformat}
>  Traceback (most recent call last):
>  File “/l/util.nightly/src/python/nightlyBench.py”, line 2177, in 
>   run()
>  File “/l/util.nightly/src/python/nightlyBench.py”, line 1225, in run
>   raise RuntimeError(‘search result differences: %s’ % str(errors))
> RuntimeError: search result differences: 
> [“query=KnnVectorQuery:vector[-0.07267512,...][10] filter=None sort=None 
> groupField=None hitCount=10: hit 4 has wrong field/score value ([20844660], 
> ‘0.92060816’) vs ([254438\
> 06], ‘0.920046’)“, “query=KnnVectorQuery:vector[-0.12073054,...][10] 
> filter=None sort=None groupField=None hitCount=10: hit 7 has wrong 
> field/score value ([25501982], ‘0.99630797’) vs ([13688085], ‘0.9961489’)“, 
> “qu\
> ery=KnnVectorQuery:vector[0.02227773,...][10] filter=None sort=None 
> groupField=None hitCount=10: hit 0 has wrong field/score value ([4741915], 
> ‘0.9481132’) vs ([14220828], ‘0.9579846’)“, “query=KnnVectorQuery:vector\
> [0.024077624,...][10] filter=None sort=None groupField=None hitCount=10: hit 
> 0 has wrong field/score value ([7472373], ‘0.8460249’) vs ([12577825], 
> ‘0.8378446’)“]{noformat}
> At first I thought this might be expected because of the recent (awesome!!) 
> improvements to HNSW, so I tried to simply "regold".  But the regold did not 
> "take", so it indeed looks like there is some non-determinism here.
> I pinged [~msoko...@gmail.com] and he found this random seeding that is most 
> likely the cause?
> {noformat}
> public final class HnswGraphBuilder {
>   /** Default random seed for level generation * */
>   private static final long DEFAULT_RAND_SEED = System.currentTimeMillis(); 
> {noformat}
> Can we somehow make this deterministic instead?  Or maybe the nightly 
> benchmarks could somehow pass something in to make results deterministic for 
> benchmarking?  Or ... we could also relax the benchmarks to accept 
> non-determinism for {{KnnVectorQuery}} task?



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10421) Non-deterministic results from KnnVectorQuery?

2022-02-14 Thread Michael Sokolov (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17492026#comment-17492026
 ] 

Michael Sokolov commented on LUCENE-10421:
--

[https://github.com/mikemccand/luceneutil/commit/6fee2290] was added to deal 
with this a while back

 

> Non-deterministic results from KnnVectorQuery?
> --
>
> Key: LUCENE-10421
> URL: https://issues.apache.org/jira/browse/LUCENE-10421
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Michael McCandless
>Priority: Major
>
> [Nightly benchmarks|https://home.apache.org/~mikemccand/lucenebench/] have 
> been upset for the past ~1.5 weeks because it looks like {{KnnVectorQuery}} 
> is giving slightly different results on every run, even on an identical 
> (deterministically constructed – single thread indexing, flush by doc count, 
> {{{}SerialMergeSchedule{}}}, {{{}LogDocCountMergePolicy{}}}, etc.) index each 
> night.  It produces failures like this, which then abort the benchmark to 
> help us catch any recent accidental bug that alters our precise top N search 
> hits and scores:
> {noformat}
>  Traceback (most recent call last):
>  File “/l/util.nightly/src/python/nightlyBench.py”, line 2177, in 
>   run()
>  File “/l/util.nightly/src/python/nightlyBench.py”, line 1225, in run
>   raise RuntimeError(‘search result differences: %s’ % str(errors))
> RuntimeError: search result differences: 
> [“query=KnnVectorQuery:vector[-0.07267512,...][10] filter=None sort=None 
> groupField=None hitCount=10: hit 4 has wrong field/score value ([20844660], 
> ‘0.92060816’) vs ([254438\
> 06], ‘0.920046’)“, “query=KnnVectorQuery:vector[-0.12073054,...][10] 
> filter=None sort=None groupField=None hitCount=10: hit 7 has wrong 
> field/score value ([25501982], ‘0.99630797’) vs ([13688085], ‘0.9961489’)“, 
> “qu\
> ery=KnnVectorQuery:vector[0.02227773,...][10] filter=None sort=None 
> groupField=None hitCount=10: hit 0 has wrong field/score value ([4741915], 
> ‘0.9481132’) vs ([14220828], ‘0.9579846’)“, “query=KnnVectorQuery:vector\
> [0.024077624,...][10] filter=None sort=None groupField=None hitCount=10: hit 
> 0 has wrong field/score value ([7472373], ‘0.8460249’) vs ([12577825], 
> ‘0.8378446’)“]{noformat}
> At first I thought this might be expected because of the recent (awesome!!) 
> improvements to HNSW, so I tried to simply "regold".  But the regold did not 
> "take", so it indeed looks like there is some non-determinism here.
> I pinged [~msoko...@gmail.com] and he found this random seeding that is most 
> likely the cause?
> {noformat}
> public final class HnswGraphBuilder {
>   /** Default random seed for level generation * */
>   private static final long DEFAULT_RAND_SEED = System.currentTimeMillis(); 
> {noformat}
> Can we somehow make this deterministic instead?  Or maybe the nightly 
> benchmarks could somehow pass something in to make results deterministic for 
> benchmarking?  Or ... we could also relax the benchmarks to accept 
> non-determinism for {{KnnVectorQuery}} task?



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10421) Non-deterministic results from KnnVectorQuery?

2022-02-14 Thread Michael Sokolov (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17492022#comment-17492022
 ] 

Michael Sokolov commented on LUCENE-10421:
--

 I think we have always had this time-dependent random seed in the HNSW 
indexing code, so I'm curious how benchmarks were working before?

> Non-deterministic results from KnnVectorQuery?
> --
>
> Key: LUCENE-10421
> URL: https://issues.apache.org/jira/browse/LUCENE-10421
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Michael McCandless
>Priority: Major
>
> [Nightly benchmarks|https://home.apache.org/~mikemccand/lucenebench/] have 
> been upset for the past ~1.5 weeks because it looks like {{KnnVectorQuery}} 
> is giving slightly different results on every run, even on an identical 
> (deterministically constructed – single thread indexing, flush by doc count, 
> {{{}SerialMergeSchedule{}}}, {{{}LogDocCountMergePolicy{}}}, etc.) index each 
> night.  It produces failures like this, which then abort the benchmark to 
> help us catch any recent accidental bug that alters our precise top N search 
> hits and scores:
> {noformat}
>  Traceback (most recent call last):
>  File “/l/util.nightly/src/python/nightlyBench.py”, line 2177, in 
>   run()
>  File “/l/util.nightly/src/python/nightlyBench.py”, line 1225, in run
>   raise RuntimeError(‘search result differences: %s’ % str(errors))
> RuntimeError: search result differences: 
> [“query=KnnVectorQuery:vector[-0.07267512,...][10] filter=None sort=None 
> groupField=None hitCount=10: hit 4 has wrong field/score value ([20844660], 
> ‘0.92060816’) vs ([254438\
> 06], ‘0.920046’)“, “query=KnnVectorQuery:vector[-0.12073054,...][10] 
> filter=None sort=None groupField=None hitCount=10: hit 7 has wrong 
> field/score value ([25501982], ‘0.99630797’) vs ([13688085], ‘0.9961489’)“, 
> “qu\
> ery=KnnVectorQuery:vector[0.02227773,...][10] filter=None sort=None 
> groupField=None hitCount=10: hit 0 has wrong field/score value ([4741915], 
> ‘0.9481132’) vs ([14220828], ‘0.9579846’)“, “query=KnnVectorQuery:vector\
> [0.024077624,...][10] filter=None sort=None groupField=None hitCount=10: hit 
> 0 has wrong field/score value ([7472373], ‘0.8460249’) vs ([12577825], 
> ‘0.8378446’)“]{noformat}
> At first I thought this might be expected because of the recent (awesome!!) 
> improvements to HNSW, so I tried to simply "regold".  But the regold did not 
> "take", so it indeed looks like there is some non-determinism here.
> I pinged [~msoko...@gmail.com] and he found this random seeding that is most 
> likely the cause?
> {noformat}
> public final class HnswGraphBuilder {
>   /** Default random seed for level generation * */
>   private static final long DEFAULT_RAND_SEED = System.currentTimeMillis(); 
> {noformat}
> Can we somehow make this deterministic instead?  Or maybe the nightly 
> benchmarks could somehow pass something in to make results deterministic for 
> benchmarking?  Or ... we could also relax the benchmarks to accept 
> non-determinism for {{KnnVectorQuery}} task?



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10177) Rename VectorValues#dimension to VectorValues#getNumDimensions?

2022-02-10 Thread Michael Sokolov (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17490196#comment-17490196
 ] 

Michael Sokolov commented on LUCENE-10177:
--

Heh, I prefer {{dimension()}} and  would probably do the rename in the other 
direction, but I won't block this

> Rename VectorValues#dimension to VectorValues#getNumDimensions?
> ---
>
> Key: LUCENE-10177
> URL: https://issues.apache.org/jira/browse/LUCENE-10177
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Adrien Grand
>Priority: Major
>
> This would make it consistent with PointValues#getNumDimensions.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10382) Allow KnnVectorQuery to operate over a subset of liveDocs

2022-01-25 Thread Michael Sokolov (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17481836#comment-17481836
 ] 

Michael Sokolov commented on LUCENE-10382:
--

> +1 on figuring out a better execution path before the release, it's going to 
> look bad if setting a filter could make the query perform many times slower 
> than a linear scan

This is fair, but by "release" do we mean – commit to lucene/main? I think so, 
because a 9.1 or later release could be cut from that at any time. So ... let's 
do the work on a feature branch to enable iterating to get it nailed down.

> Allow KnnVectorQuery to operate over a subset of liveDocs
> -
>
> Key: LUCENE-10382
> URL: https://issues.apache.org/jira/browse/LUCENE-10382
> Project: Lucene - Core
>  Issue Type: Improvement
>Affects Versions: 9.0
>Reporter: Joel Bernstein
>Priority: Major
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> Currently the KnnVectorQuery selects the top K vectors from all live docs.  
> This ticket will change the interface to make it possible for the top K 
> vectors to be selected from a subset of the live docs.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Comment Edited] (LUCENE-10382) Allow KnnVectorQuery to operate over a subset of liveDocs

2022-01-20 Thread Michael Sokolov (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17479649#comment-17479649
 ] 

Michael Sokolov edited comment on LUCENE-10382 at 1/20/22, 8:20 PM:


> I'm a little fuzzy on the cost computation being discussed. Is this about the 
> decision to do the ANN or fully materialized KNN?

Yes. I wouldn't worry about that at first though. Maybe we can do three steps 
something like this:
 # implement Query-based filter, always using HNSW search that we have today. 
It would have to be marked with some serious caveats about potential 
performance risk, but we should make progress somehow without insisting on the 
full implementation at once. Perhaps we can just document the risk, mark as 
experimental in javadoc?
 # implement full KNN fallback with a fixed cutoff (based on Query cost?)
 # implement an adaptive cost computation

also, maybe we're overthinking 3 and it's not really needed/simpler than we 
think?


was (Author: sokolov):
> I'm a little fuzzy on the cost computation being discussed. Is this about the 
> decision to do the ANN or fully materialized KNN?

Yes. I wouldn't worry about that at first though. Maybe we can do three steps 
something like this:
 # implement Query-based filter, always using HNSW search that we have today. 
It would have to be marked with some serious caveats about potential 
performance risk, but we should make progress somehow without insisting on the 
full implementation at once. Perhaps we can just document the risk, mark as 
experimental in javadoc?
 # implement full KNN fallback with a fixed cutoff (based on Query cost?)
 # implement an adaptive cost computation

> Allow KnnVectorQuery to operate over a subset of liveDocs
> -
>
> Key: LUCENE-10382
> URL: https://issues.apache.org/jira/browse/LUCENE-10382
> Project: Lucene - Core
>  Issue Type: Improvement
>Affects Versions: 9.0
>Reporter: Joel Bernstein
>Priority: Major
>
> Currently the KnnVectorQuery selects the top K vectors from all live docs.  
> This ticket will change the interface to make it possible for the top K 
> vectors to be selected from a subset of the live docs.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10382) Allow KnnVectorQuery to operate over a subset of liveDocs

2022-01-20 Thread Michael Sokolov (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17479649#comment-17479649
 ] 

Michael Sokolov commented on LUCENE-10382:
--

> I'm a little fuzzy on the cost computation being discussed. Is this about the 
> decision to do the ANN or fully materialized KNN?

Yes. I wouldn't worry about that at first though. Maybe we can do three steps 
something like this:
 # implement Query-based filter, always using HNSW search that we have today. 
It would have to be marked with some serious caveats about potential 
performance risk, but we should make progress somehow without insisting on the 
full implementation at once. Perhaps we can just document the risk, mark as 
experimental in javadoc?
 # implement full KNN fallback with a fixed cutoff (based on Query cost?)
 # implement an adaptive cost computation

> Allow KnnVectorQuery to operate over a subset of liveDocs
> -
>
> Key: LUCENE-10382
> URL: https://issues.apache.org/jira/browse/LUCENE-10382
> Project: Lucene - Core
>  Issue Type: Improvement
>Affects Versions: 9.0
>Reporter: Joel Bernstein
>Priority: Major
>
> Currently the KnnVectorQuery selects the top K vectors from all live docs.  
> This ticket will change the interface to make it possible for the top K 
> vectors to be selected from a subset of the live docs.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10382) Allow KnnVectorQuery to operate over a subset of liveDocs

2022-01-20 Thread Michael Sokolov (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17479487#comment-17479487
 ] 

Michael Sokolov commented on LUCENE-10382:
--

If we go with a {{{}Query{}}}-based filter, I guess it would still be possible 
to create a query wrapping a {{BitSetProducer}} (like TPBJQ), so it's not as if 
it's a hard decision preventing a customer providing a precomputed bitset - I 
think?

Also, relying on the cache makes sense to me, but I have some reservations. One 
issue we've found is that because it caches entire Query results, it can often 
miss significant caching opportunities, say when a complex {{BooleanQuery}} has 
a subset of clauses that can profitably be cached. Maybe the Query-writer can 
structure their queries to be more cache-friendly by nesting BQs? But then 
again they get rewritten and may be flattened prior to the cache seeing them.

Anyway maybe we can enhance the cache, but this is a separate issue;  +1 to 
move ahead using Query

> Allow KnnVectorQuery to operate over a subset of liveDocs
> -
>
> Key: LUCENE-10382
> URL: https://issues.apache.org/jira/browse/LUCENE-10382
> Project: Lucene - Core
>  Issue Type: Improvement
>Affects Versions: 9.0
>Reporter: Joel Bernstein
>Priority: Major
>
> Currently the KnnVectorQuery selects the top K vectors from all live docs.  
> This ticket will change the interface to make it possible for the top K 
> vectors to be selected from a subset of the live docs.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10382) Allow KnnVectorQuery to operate over a subset of liveDocs

2022-01-19 Thread Michael Sokolov (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17479017#comment-17479017
 ] 

Michael Sokolov commented on LUCENE-10382:
--

> I thought about passing in Bits but I don't think it will work because 
> searchLeaf is at the segment level so we'd need segment level Bits.

Oh, good point!

> Allow KnnVectorQuery to operate over a subset of liveDocs
> -
>
> Key: LUCENE-10382
> URL: https://issues.apache.org/jira/browse/LUCENE-10382
> Project: Lucene - Core
>  Issue Type: Improvement
>Affects Versions: 9.0
>Reporter: Joel Bernstein
>Priority: Major
>
> Currently the KnnVectorQuery selects the top K vectors from all live docs.  
> This ticket will change the interface to make it possible for the top K 
> vectors to be selected from a subset of the live docs.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Comment Edited] (LUCENE-10382) Allow KnnVectorQuery to operate over a subset of liveDocs

2022-01-19 Thread Michael Sokolov (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17479014#comment-17479014
 ] 

Michael Sokolov edited comment on LUCENE-10382 at 1/19/22, 10:54 PM:
-

How would this look? An easy first step is to add a filter parameter to 
KnnVectorQuery

{{  public KnnVectorQuery(String field, float[] target, int k, Bits filter)}}

then it can call {{LeafReader.searchNearestVectors}} with 
{{liveDocs.intersect(filter)}} instead of {{liveDocs.}}

[~julietibs] shared on list a link to a paper showing how the search 
degenerates for highly selective filters. The writers' approach was to fall 
back to "brute force" KNN when selectivity passes a fixed threshold. We could 
do that too, and it makes sense to me, but I guess the question is: where 
should this fallback happen in the API?

The implementation of full (non-approximate) KNN (with a filter) only needs the 
VectorValues iterator which the KnnVectorsReader already provides. It could be 
implemented as part of KnnVectorQuery. Is there a better place?

Hmm, previously [~jpountz] had suggested this API:

{{   searchNearestNeighbors(String field, float[] target, Query filter)}}

taking a {{Query}} rather than a {{Bits }}as a filter. I guess that is more 
high-level, but in my mind a typical use case would be to use this with a 
precomputed filter (as a pre-filter) and then post filter by embedding this 
KnnVectorQuery in another BooleanQuery. And given that we have to compute a 
full bitset in advance anyway, exposing a Query interface seems a bit like 
over-promising. It's clearer to just provide Bits, I think?

Another open question is how to determine the threshold for cutting over to 
full KNN, and whether that will be user configurable at all. Ideally we can 
just pick a percent coverage and make it fixed.

Hmm I just realized another possible concern is that the vectors themselves may 
not be dense in the documents, and that will impact the coverage of the filter 
bits. So to get an accurate coverage number, we'd in theory have to fully 
intersect the KnnVector bits (which docs have vectors in the graph) with the 
filter *and* the liveDocs, and compare the cardinality with that of the graph. 
Although maybe an approximation here is enough - intersect a subset of the bits 
to estimate the total coverage?


was (Author: sokolov):
How would this look? An easy first step is to add a filter parameter to 
KnnVectorQuery 

{{  public KnnVectorQuery(String field, float[] target, int k, Bits filter)}}

then it can call {{LeafReader.searchNearestVectors}} with 
{{liveDocs.intersect(filter)}} instead of {{liveDocs.}}

[~julietibs] shared on list a link to a paper showing how the search 
degenerates for highly selective filters. The writers' approach was to fall 
back to "brute force" KNN when selectivity passes a fixed threshold. We could 
do that too, and it makes sense to me, but I guess the question is: where 
should this fallback happen in the API?

The implementation of full (non-approximate) KNN (with a filter) only needs the 
VectorValues iterator which the KnnVectorsReader already provides. It could be 
implemented as part of KnnVectorQuery. Is there a better place?

> Allow KnnVectorQuery to operate over a subset of liveDocs
> -
>
> Key: LUCENE-10382
> URL: https://issues.apache.org/jira/browse/LUCENE-10382
> Project: Lucene - Core
>  Issue Type: Improvement
>Affects Versions: 9.0
>Reporter: Joel Bernstein
>Priority: Major
>
> Currently the KnnVectorQuery selects the top K vectors from all live docs.  
> This ticket will change the interface to make it possible for the top K 
> vectors to be selected from a subset of the live docs.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10382) Allow KnnVectorQuery to operate over a subset of liveDocs

2022-01-19 Thread Michael Sokolov (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17479014#comment-17479014
 ] 

Michael Sokolov commented on LUCENE-10382:
--

How would this look? An easy first step is to add a filter parameter to 
KnnVectorQuery 

{{  public KnnVectorQuery(String field, float[] target, int k, Bits filter)}}

then it can call {{LeafReader.searchNearestVectors}} with 
{{liveDocs.intersect(filter)}} instead of {{liveDocs.}}

[~julietibs] shared on list a link to a paper showing how the search 
degenerates for highly selective filters. The writers' approach was to fall 
back to "brute force" KNN when selectivity passes a fixed threshold. We could 
do that too, and it makes sense to me, but I guess the question is: where 
should this fallback happen in the API?

The implementation of full (non-approximate) KNN (with a filter) only needs the 
VectorValues iterator which the KnnVectorsReader already provides. It could be 
implemented as part of KnnVectorQuery. Is there a better place?

> Allow KnnVectorQuery to operate over a subset of liveDocs
> -
>
> Key: LUCENE-10382
> URL: https://issues.apache.org/jira/browse/LUCENE-10382
> Project: Lucene - Core
>  Issue Type: Improvement
>Affects Versions: 9.0
>Reporter: Joel Bernstein
>Priority: Major
>
> Currently the KnnVectorQuery selects the top K vectors from all live docs.  
> This ticket will change the interface to make it possible for the top K 
> vectors to be selected from a subset of the live docs.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10375) Speed up HNSW merge by writing combined vector data

2022-01-13 Thread Michael Sokolov (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17475446#comment-17475446
 ] 

Michael Sokolov commented on LUCENE-10375:
--

Ooh, exciting. That code was complicated and tricky to get right too. I guess 
in hindsight it's not too surprising that it added some overhead. I will check 
out the PR; thanks for this!

> Speed up HNSW merge by writing combined vector data
> ---
>
> Key: LUCENE-10375
> URL: https://issues.apache.org/jira/browse/LUCENE-10375
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Julie Tibshirani
>Priority: Major
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> When merging segments together, the HNSW writer creates a VectorValues 
> instance that gives a merged view of all the segments' VectorValues. This 
> merged instance is used when constructing the new HNSW graph. Graph building 
> needs random access, and the merged VectorValues support this by mapping from 
> merged ordinals -> segments and segment ordinals.
> This mapping seems to add overhead. The nightly indexing benchmarks sometimes 
> show substantial time in Arrays.binarySearch (used to map an ordinal to a 
> segment): 
> https://blunders.io/jfr-demo/indexing-1kb-vectors-2022.01.09.18.03.19/top_down_cpu_samples.
> Instead of using a merged VectorValues to create the graph, maybe we could 
> first write all the segment vectors to a file, and use that file to build the 
> graph.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10314) inconsistent index options when opening pre 9.0.0 index with 9.0.0

2021-12-14 Thread Michael Sokolov (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17459281#comment-17459281
 ] 

Michael Sokolov commented on LUCENE-10314:
--

Small side note here; we have this comment in {{{}IndexOptions{}}}:

{{public enum IndexOptions {}}
{{  // NOTE: order is important here; FieldInfo uses this}}
{{  // order to merge two conflicting IndexOptions (always}}
{{  // "downgrades" by picking the lowest).}}

which is probably no longer relevant (since conflicting {{IndexOptions}} are 
forbidden).

> inconsistent index options when opening pre 9.0.0 index with 9.0.0
> --
>
> Key: LUCENE-10314
> URL: https://issues.apache.org/jira/browse/LUCENE-10314
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/index
>Affects Versions: 9.0
>Reporter: Ian Lea
>Priority: Major
>
> We have a long-standing index with some mandatory fields and some optional
> fields that has been through multiple lucene upgrades without a full
> rebuild and on testing out an upgrade from version 8.11.0 to 9.0.0, when
> open an IndexWriter we are hitting the exception
> Exception in thread "main" java.lang.IllegalArgumentException: cannot
> change field "language" from index options=NONE to inconsistent index
> options=DOCS
>         at
> org.apache.lucene.index.FieldInfo.verifySameIndexOptions(FieldInfo.java:245)
>         at
> org.apache.lucene.index.FieldInfos$FieldNumbers.verifySameSchema(FieldInfos.java:421)
>         at
> org.apache.lucene.index.FieldInfos$FieldNumbers.addOrGet(FieldInfos.java:357)
>         at
> org.apache.lucene.index.IndexWriter.getFieldNumberMap(IndexWriter.java:1263)
>         at org.apache.lucene.index.IndexWriter.(IndexWriter.java:1116)
> Where language is one of our optional fields.
> Presumably this is at least somewhat related to "Index options can no
> longer be changed dynamically" as mentioned at
> https://lucene.apache.org/core/9_0_0/MIGRATE.html although it fails before
> our code attempts to update the index, and we are not trying to change any
> index options.
> Adding some displays to IndexWriter and FieldInfos and logging rather than
> throwing the exception I see
>  language     curr=NONE, other=NONE
>  language     curr=NONE, other=NONE
>  language     curr=NONE, other=NONE
>  language     curr=NONE, other=NONE
>  language     curr=NONE, other=NONE
>  language     curr=NONE, other=NONE
>  language     curr=NONE, other=NONE
>  language     curr=NONE, other=NONE
>  language     curr=NONE, other=DOCS
>  language     curr=NONE, other=NONE
>  language     curr=NONE, other=NONE
>  language     curr=NONE, other=NONE
>  language     curr=NONE, other=NONE
>  language     curr=NONE, other=NONE
>  language     curr=NONE, other=NONE
>  language     curr=NONE, other=NONE
>  language     curr=NONE, other=NONE
>  language     curr=NONE, other=NONE
>  language     curr=NONE, other=DOCS
>  language     curr=NONE, other=DOCS
>  language     curr=NONE, other=DOCS
>  language     curr=NONE, other=DOCS
>  language     curr=NONE, other=DOCS
>  language     curr=NONE, other=DOCS
>  language     curr=NONE, other=DOCS
>  language     curr=NONE, other=DOCS
> where there is one line per segment.  It logs the exception whenever
> other=DOCS.  Subset with segment info:
> segment _x8(8.2.0):c31753/-1:[diagnostics={timestamp=1565623850605,
> lucene.version=8.2.0, java.vm.version=11.0.3+7, java.version=11.0.3,
> mergeMaxNumSegments=-1, os.version=3.1.0-1.2-desktop,
> java.vendor=AdoptOpenJDK, source=merge, os.arch=amd64, mergeFactor=10,
> java.runtime.version=11.0.3+7,
> os=Linux}]:[attributes=\{Lucene50StoredFieldsFormat.mode=BEST_SPEED}]
>  language     curr=NONE, other=NONE
> segment _y9(8.7.0):c43531/-1:[diagnostics={timestamp=1604597581562,
> lucene.version=8.7.0, java.vm.version=11.0.3+7, java.version=11.0.3,
> mergeMaxNumSegments=-1, os.version=3.1.0-1.2-desktop,
> java.vendor=AdoptOpenJDK, source=merge, os.arch=amd64, mergeFactor=10,
> java.runtime.version=11.0.3+7,
> os=Linux}]:[attributes=\{Lucene87StoredFieldsFormat.mode=BEST_SPEED}]
>  language     curr=NONE, other=DOCS
> NOT throwing java.lang.IllegalArgumentException: cannot change field
> "language" from index options=NONE to inconsistent index options=DOCS
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Updated] (LUCENE-10281) Error condition used to judge whether hits are sparse in StringValueFacetCounts

2021-12-03 Thread Michael Sokolov (Jira)



 [ 
https://issues.apache.org/jira/browse/LUCENE-10281?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Sokolov updated LUCENE-10281:
-
Issue Type: Improvement  (was: Bug)

> Error condition used to judge whether hits are sparse in 
> StringValueFacetCounts
> ---
>
> Key: LUCENE-10281
> URL: https://issues.apache.org/jira/browse/LUCENE-10281
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/facet
>Affects Versions: 8.11
>Reporter: Lu Xugang
>Priority: Minor
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Description：
> In construction method StringValueFacetCounts(StringDocValuesReaderState 
> state, FacetsCollector facetsCollector), if facetsCollector was provided, a 
> condition of *(totalHits < totalDocs / 10)* used to judge whether using 
> IntIntHashMap which means sparse to store term ord and count 。
> But per totalHits doesn't means it must be containing SSDV , and so is 
> totalDocs. so the right calculation should be *( totalHits has SSDV) / 
> (totalDocs has SSDV) .( totalDocs has SSDV)* was easy to get by 
> SortedSetDocValues#getValueCount(), *totalHits has SSDV* is hard to get 
> because we can only read index by docId provided by FacetsCollector, but the 
> way of getting *totalHits has SSDV* is slow and redundant.
> Solution:
> if we don't wanna to break the old logic that using denseCounts while 
> cardinality < 1024 and using IntIntHashMap while 10% threshold and using 
> denseCounts while the rest of the case, then we could still use denseCounts 
> if cardinality < 1024, if not , using IntIntHashMap. when 10% of the unique 
> term collected，then change to use denseCounts.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10281) Error condition used to judge whether hits are sparse in StringValueFacetCounts

2021-12-03 Thread Michael Sokolov (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10281?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17453134#comment-17453134
 ] 

Michael Sokolov commented on LUCENE-10281:
--

I don't consider this to be a bug since it only affects a heuristic used to 
improve performance. Have you done any performance measurements with this 
change, [~ChrisLu] ?

> Error condition used to judge whether hits are sparse in 
> StringValueFacetCounts
> ---
>
> Key: LUCENE-10281
> URL: https://issues.apache.org/jira/browse/LUCENE-10281
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: modules/facet
>Affects Versions: 8.11
>Reporter: Lu Xugang
>Priority: Minor
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Description：
> In construction method StringValueFacetCounts(StringDocValuesReaderState 
> state, FacetsCollector facetsCollector), if facetsCollector was provided, a 
> condition of *(totalHits < totalDocs / 10)* used to judge whether using 
> IntIntHashMap which means sparse to store term ord and count 。
> But per totalHits doesn't means it must be containing SSDV , and so is 
> totalDocs. so the right calculation should be *( totalHits has SSDV) / 
> (totalDocs has SSDV) .( totalDocs has SSDV)* was easy to get by 
> SortedSetDocValues#getValueCount(), *totalHits has SSDV* is hard to get 
> because we can only read index by docId provided by FacetsCollector, but the 
> way of getting *totalHits has SSDV* is slow and redundant.
> Solution:
> if we don't wanna to break the old logic that using denseCounts while 
> cardinality < 1024 and using IntIntHashMap while 10% threshold and using 
> denseCounts while the rest of the case, then we could still use denseCounts 
> if cardinality < 1024, if not , using IntIntHashMap. when 10% of the unique 
> term collected，then change to use denseCounts.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10276) Turning LRUQueryCache On selectively for only a few query classes

2021-12-01 Thread Michael Sokolov (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17452037#comment-17452037
 ] 

Michael Sokolov commented on LUCENE-10276:
--

The way we are using at Amazon scores are ignored and the query functions as a 
filter. I think QueryCachingPolicy can be used to achieve this kind of thing?

> Turning LRUQueryCache On selectively for only a few query classes
> -
>
> Key: LUCENE-10276
> URL: https://issues.apache.org/jira/browse/LUCENE-10276
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Angadpreet Nagpal
>Priority: Minor
>
> Hi,
> I'm part of the Amazon Lucene team. I was trying to get KNN queries to work 
> with LRUQueryCache. From my understanding, if I have isCacheable to return 
> true from my query class then it will taken care of. However, I want Lucene 
> to only turn on the cache for KNN only and not for any other query like 
> TermQuery, etc which might also be returning true for the isCacheable method. 
> Is there any easy interface which would allow me to do this without having to 
> go into each individual query class and returning false for the isCacheable  
> for all of them or having to write my own cache class. Any help would be 
> appreciated.
> Please correct me if I have incorrectly assumed something. This is my first 
> issue!
>  
> Thanks,
> Angadpreet Nagpal



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10258) DoubleValuesSource#fromScorer() should leverage ScoreCachingWrappingScorer

2021-11-24 Thread Michael Sokolov (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17448718#comment-17448718
 ] 

Michael Sokolov commented on LUCENE-10258:
--

right, this is more an issue of how the Scorable was provided – maybe this 
investigation turns up an opportunity to add caching in another place?

> DoubleValuesSource#fromScorer() should leverage ScoreCachingWrappingScorer
> --
>
> Key: LUCENE-10258
> URL: https://issues.apache.org/jira/browse/LUCENE-10258
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/search
>Reporter: Greg Miller
>Assignee: Greg Miller
>Priority: Minor
>
> DoubleValuesSource#fromScorer() is a helper method for wrapping a Scorable in 
> a DoubleValues instance (providing the value of Scorable#score() as its 
> value). It would be nice to leverage ScoreCachingWrappingScorer for this 
> use-case to avoid unnecessarily recomputing the score if the Scorable is 
> being used elsewhere.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Comment Edited] (LUCENE-10247) Reduce FST size by using absolute and relative coding for target pointers

2021-11-22 Thread Michael Sokolov (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17447625#comment-17447625
 ] 

Michael Sokolov edited comment on LUCENE-10247 at 11/22/21, 8:52 PM:
-

As far as testing goes, https://issues.apache.org/jira/browse/LUCENE-8781 has 
some pointers. IIRC that one relied on tests in luceneutil plus Suggester unit 
tests (they do a lot of FST lookups), and there is also a torture test 
(commented out) in TestJapaneseTokenizer using JA wikipedia. That one is 
interesting because the character distribution is quite different from 
euro-centric texts.

 

I'll also echo that speed (search, mainly, but to some extent compilation speed 
matters)  is generally the main factor, but of course size reduction is welcome 
if it doesn't cost too much!


was (Author: sokolov):
As far as testing goes, https://issues.apache.org/jira/browse/LUCENE-8781 has 
some pointers. IIRC that one relied on tests in luceneutil plus Suggester unit 
tests (they do a lot of FST lookups), and there is also a torture test 
(commented out) in TestJapaneseTokenizer using JA wikipedia. That one is 
interesting because the character distribution is quite different from 
euro-centric texts.

> Reduce FST size by using absolute and relative coding for target pointers
> -
>
> Key: LUCENE-10247
> URL: https://issues.apache.org/jira/browse/LUCENE-10247
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/FSTs
>Reporter: Hendrik Muhs
>Priority: Major
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> FST's use various tricks to reduce size. One more trick that can be added is 
> using relative coding for the target pointers that connect nodes.
> Currently if a node has a target and it's not optimized using the `next` 
> flag, it writes a VLong which contains the address of the child. This is an 
> absolute address. A relative address can be shorter, relative address means 
> taking the address of the node and expressing the result of child node as 
> diff between the parent and the child. E.g. if the child is _10099_ and the 
> parent {_}1{_}, an absolute pointer requires 2 bytes, but the relative 
> pointer ({_}10099 - 1{_}) requires only 1 byte. Of course that's not 
> always true, the absolute pointer could be smaller. Therefore the idea is to 
> switch between the two, dependent on what's smaller.
> (FWIW: From a theory standpoint this works nicely, because the FST is 
> constructed from the leafs to the root, child node addresses are always 
> smaller than parent ones and due to good locality, connected nodes are 
> usually stored close to each other )
> I have a prototype for this, it will be pushed as a draft PR and added as 
> link here. It introduces a new bit in _flags_ and changes the compiler to use 
> relative coding if possible. The lookup side needs the current node address 
> and checks one more bit flag. Both additions seem reasonable cheap and do not 
> add a lot of runtime.
> In my measurements I was able to reduce the size of a 4.2 million key 
> dictionary from 92MB ot 87MB, so saved 5MB, which is roughly 5.5%. If you 
> have better reference datasets, let me know. It obviously works better the 
> more data.
> The POC implementation only supports 1 output type (positive int output) and 
> I just made it compile. For this 1 type of dictionary it's possible to create 
> and dump dictionaries correctly.
> Before spending more time on this I like to hear feedback and get agreement 
> on whether this is something you like to have. It seems to me that a size 
> reduction of FST's is always welcome and as pointed out, the runtime 
> additions are reasonable small. If you wonder about the implementation, I 
> happily discuss the nitty gritty details on gh, it had some interesting 
> challenges.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10247) Reduce FST size by using absolute and relative coding for target pointers

2021-11-22 Thread Michael Sokolov (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17447625#comment-17447625
 ] 

Michael Sokolov commented on LUCENE-10247:
--

As far as testing goes, https://issues.apache.org/jira/browse/LUCENE-8781 has 
some pointers. IIRC that one relied on tests in luceneutil plus Suggester unit 
tests (they do a lot of FST lookups), and there is also a torture test 
(commented out) in TestJapaneseTokenizer using JA wikipedia. That one is 
interesting because the character distribution is quite different from 
euro-centric texts.

> Reduce FST size by using absolute and relative coding for target pointers
> -
>
> Key: LUCENE-10247
> URL: https://issues.apache.org/jira/browse/LUCENE-10247
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/FSTs
>Reporter: Hendrik Muhs
>Priority: Major
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> FST's use various tricks to reduce size. One more trick that can be added is 
> using relative coding for the target pointers that connect nodes.
> Currently if a node has a target and it's not optimized using the `next` 
> flag, it writes a VLong which contains the address of the child. This is an 
> absolute address. A relative address can be shorter, relative address means 
> taking the address of the node and expressing the result of child node as 
> diff between the parent and the child. E.g. if the child is _10099_ and the 
> parent {_}1{_}, an absolute pointer requires 2 bytes, but the relative 
> pointer ({_}10099 - 1{_}) requires only 1 byte. Of course that's not 
> always true, the absolute pointer could be smaller. Therefore the idea is to 
> switch between the two, dependent on what's smaller.
> (FWIW: From a theory standpoint this works nicely, because the FST is 
> constructed from the leafs to the root, child node addresses are always 
> smaller than parent ones and due to good locality, connected nodes are 
> usually stored close to each other )
> I have a prototype for this, it will be pushed as a draft PR and added as 
> link here. It introduces a new bit in _flags_ and changes the compiler to use 
> relative coding if possible. The lookup side needs the current node address 
> and checks one more bit flag. Both additions seem reasonable cheap and do not 
> add a lot of runtime.
> In my measurements I was able to reduce the size of a 4.2 million key 
> dictionary from 92MB ot 87MB, so saved 5MB, which is roughly 5.5%. If you 
> have better reference datasets, let me know. It obviously works better the 
> more data.
> The POC implementation only supports 1 output type (positive int output) and 
> I just made it compile. For this 1 type of dictionary it's possible to create 
> and dump dictionaries correctly.
> Before spending more time on this I like to hear feedback and get agreement 
> on whether this is something you like to have. It seems to me that a size 
> reduction of FST's is always welcome and as pointed out, the runtime 
> additions are reasonable small. If you wonder about the implementation, I 
> happily discuss the nitty gritty details on gh, it had some interesting 
> challenges.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10246) Support getting counts from "association" facets

2021-11-19 Thread Michael Sokolov (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17446547#comment-17446547
 ] 

Michael Sokolov commented on LUCENE-10246:
--

Thanks, Greg! I think we might also want an option to report the aggregated 
weights, rather than the counts. Consider merging multiple such facet responses 
across a sharded  index; you'd need to preserve the aggregated weights in order 
to properly rank the facets across the entire index.

> Support getting counts from "association" facets
> 
>
> Key: LUCENE-10246
> URL: https://issues.apache.org/jira/browse/LUCENE-10246
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/facet
>Reporter: Greg Miller
>Priority: Minor
>
> We have these nice "association" facet implementations today that aggregate 
> "weights" from the docs that facet over, but they don't keep track of counts. 
> So the user can get "top-n" values for a dim by aggregated weight (great!), 
> but can't know how many docs matched each value. It would be nice to support 
> this so users could show the top-n values but _also_ show counts associated 
> with each.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10069) HNSW can miss results with very large k

2021-11-11 Thread Michael Sokolov (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17442330#comment-17442330
 ] 

Michael Sokolov commented on LUCENE-10069:
--

> just document this behavior for now?

This makes sense to me. Thinking of how we might conceivably fix, I think we 
would have to maintain a list of disconnected graph parts and explicitly 
connect them all? Even then I'm not sure we'd be able to have a hard guarantee. 
It seems like a challenging problem, and the impact is not great.

> HNSW can miss results with very large k
> ---
>
> Key: LUCENE-10069
> URL: https://issues.apache.org/jira/browse/LUCENE-10069
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Julie Tibshirani
>Priority: Minor
>
> Performing a kNN search with very large k (where k is close to the total 
> number of live docs) can return fewer than k documents. This occurs because 
> searches aren't guaranteed to be able to visit every node in the graph. 
> Specifically, when choosing entry points for the search, we make k random 
> draws _with replacement_, and sometimes end up with fewer than k entry 
> points. These entry points may not provide a connection to all other nodes in 
> the graph.
> This is an unusual case, but I think it'd be nice if we could always return k 
> docs when they are available (or at least document the behavior?) We're 
> currently working on adding graph layers (LUCENE-10054), which will change 
> how entry points are selected. Maybe we can revisit this issue once that work 
> is further along to see if a fix is still needed.
> Here's an example test failure showing the problem. We happen to select 
> {{numDocs=101}} and {{k=100}}.
> {code:java}
> ./gradlew test --tests TestKnnVectorQuery.testRandom 
> -Dtests.seed=3B6CE0E105431E18
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

1 2 3 4 5 >

1 - 100 of 476 matches

Mail list logo