[GitHub] [lucene] navneet1v commented on a diff in pull request #1017: LUCENE-10654: Add new ShapeDocValuesField for LatLonShape and XYShape

2022-07-12 Thread GitBox


navneet1v commented on code in PR #1017:
URL: https://github.com/apache/lucene/pull/1017#discussion_r919668826


##
lucene/core/src/java/org/apache/lucene/document/ShapeDocValuesField.java:
##
@@ -0,0 +1,844 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.document;
+
+import java.io.IOException;
+import java.util.ArrayList;
+import java.util.Comparator;
+import java.util.List;
+import org.apache.lucene.analysis.Analyzer;
+import org.apache.lucene.analysis.TokenStream;
+import org.apache.lucene.document.ShapeField.DecodedTriangle.TYPE;
+import org.apache.lucene.document.ShapeField.QueryRelation;
+import org.apache.lucene.document.SpatialQuery.EncodedRectangle;
+import org.apache.lucene.index.DocValuesType;
+import org.apache.lucene.index.IndexableFieldType;
+import org.apache.lucene.index.PointValues.Relation;
+import org.apache.lucene.search.Query;
+import org.apache.lucene.store.ByteArrayDataInput;
+import org.apache.lucene.store.ByteBuffersDataOutput;
+import org.apache.lucene.store.DataInput;
+import org.apache.lucene.util.ArrayUtil;
+import org.apache.lucene.util.BytesRef;
+
+/** A doc values field representation for {@link LatLonShape} and {@link 
XYShape} */
+public final class ShapeDocValuesField extends Field {
+  private final ShapeComparator shapeComparator;
+
+  private static final FieldType FIELD_TYPE = new FieldType();
+
+  static {
+FIELD_TYPE.setDocValuesType(DocValuesType.BINARY);
+FIELD_TYPE.setOmitNorms(true);
+FIELD_TYPE.freeze();
+  }
+
+  /**
+   * Creates a {@ShapeDocValueField} instance from a shape tessellation
+   *
+   * @param name The Field Name (must not be null)
+   * @param tessellation The tessellation (must not be null)
+   */
+  ShapeDocValuesField(String name, List 
tessellation) {
+super(name, FIELD_TYPE);
+BytesRef b = computeBinaryValue(tessellation);
+this.fieldsData = b;
+try {
+  this.shapeComparator = new ShapeComparator(b);
+} catch (IOException e) {
+  throw new IllegalArgumentException("unable to read binary shape doc 
value field. ", e);
+}
+  }
+
+  /** Creates a {@code ShapeDocValue} field from a given serialized value */
+  ShapeDocValuesField(String name, BytesRef binaryValue) {

Review Comment:
   the constructor are not public how the clients are can use these doc values?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] zacharymorn commented on pull request #1018: LUCENE-10480: Use BulkScorer to limit BMMScorer to only top-level disjunctions

2022-07-12 Thread GitBox


zacharymorn commented on PR #1018:
URL: https://github.com/apache/lucene/pull/1018#issuecomment-1182774748

   Benchmark results with `wikinightly.tasks` boolean queries below:
   
   ```
   TaskQPS baseline  StdDevQPS 
my_modified_version  StdDevPct diff p-value
  BrowseMonthTaxoFacets   28.81 (37.2%)   26.45 
(32.0%)   -8.2% ( -56% -   97%) 0.454
 OrHighMedDayTaxoFacets   17.65  (4.5%)   16.78  
(5.3%)   -5.0% ( -14% -5%) 0.001
BrowseRandomLabelTaxoFacets   27.58 (50.2%)   26.72 
(45.1%)   -3.1% ( -65% -  185%) 0.836
 TermBGroup1M1P   37.75  (7.6%)   36.62  
(6.5%)   -3.0% ( -15% -   11%) 0.179
   TermGroup100   36.05  (5.4%)   35.18  
(4.5%)   -2.4% ( -11% -8%) 0.130
 IntNRQ   90.71  (4.7%)   88.69  
(7.2%)   -2.2% ( -13% -   10%) 0.248
   TermBGroup1M   30.11  (5.3%)   29.64  
(5.1%)   -1.6% ( -11% -9%) 0.343
 TermDateFacets   48.93  (4.5%)   48.28  
(5.0%)   -1.3% ( -10% -8%) 0.377
   SloppyPhrase   13.21  (3.3%)   13.05  
(3.5%)   -1.2% (  -7% -5%) 0.256
   IntervalsOrdered  125.27  (7.0%)  123.79  
(7.9%)   -1.2% ( -14% -   14%) 0.615
   MedTermDayTaxoFacets   78.33  (4.2%)   77.48  
(4.5%)   -1.1% (  -9% -8%) 0.429
  TermDayOfYearSort  254.99  (3.5%)  252.39  
(2.9%)   -1.0% (  -7% -5%) 0.312
AndHighMedDayTaxoFacets  122.91  (2.6%)  121.74  
(2.8%)   -1.0% (  -6% -4%) 0.265
   SpanNear6.11  (5.6%)6.05  
(4.4%)   -0.9% ( -10% -9%) 0.583
 AndHighMed  144.28  (4.2%)  143.04  
(4.9%)   -0.9% (  -9% -8%) 0.556
AndHighHigh   43.39  (2.6%)   43.04  
(4.0%)   -0.8% (  -7% -5%) 0.449
 Phrase   52.64  (4.4%)   52.26  
(4.6%)   -0.7% (  -9% -8%) 0.615
   AndHighHighDayTaxoFacets   11.91  (2.9%)   11.83  
(3.6%)   -0.7% (  -6% -6%) 0.527
 TermDTSort  331.47  (3.4%)  329.38  
(3.3%)   -0.6% (  -7% -6%) 0.552
AndHighOrMedMed   90.33  (4.4%)   90.06  
(4.8%)   -0.3% (  -9% -9%) 0.841
   TermGroup10K   42.46  (4.3%)   42.38  
(4.3%)   -0.2% (  -8% -8%) 0.886
  BrowseMonthSSDVFacets   29.10 (14.2%)   29.05  
(9.5%)   -0.2% ( -20% -   27%) 0.965
TermGroup1M   40.35  (4.0%)   40.30  
(4.3%)   -0.1% (  -8% -8%) 0.932
   AndMedOrHighHigh   86.73  (3.5%)   86.76  
(3.9%)0.0% (  -7% -7%) 0.978
  TermMonthSort  273.18  (7.7%)  273.28  
(8.4%)0.0% ( -14% -   17%) 0.989
 Fuzzy2   81.84  (2.8%)   81.91  
(2.9%)0.1% (  -5% -5%) 0.918
   PKLookup  321.81  (5.4%)  322.43  
(5.8%)0.2% ( -10% -   12%) 0.914
  TermTitleSort  188.55  (8.0%)  188.92  
(8.3%)0.2% ( -14% -   17%) 0.939
Respell  111.20  (2.5%)  111.46  
(3.7%)0.2% (  -5% -6%) 0.815
 Fuzzy1   78.31  (2.9%)   78.64  
(2.9%)0.4% (  -5% -6%) 0.648
BrowseRandomLabelSSDVFacets   19.92  (8.2%)   20.03  
(6.4%)0.5% ( -13% -   16%) 0.821
   Term 3440.49  (3.9%) 3461.12  
(4.8%)0.6% (  -7% -9%) 0.664
  BrowseDayOfYearSSDVFacets   26.22 (12.5%)   26.47  
(4.8%)0.9% ( -14% -   20%) 0.751
   BrowseDateTaxoFacets   27.49 (32.2%)   27.82 
(32.6%)1.2% ( -48% -   97%) 0.905
  BrowseDayOfYearTaxoFacets   27.84 (31.8%)   28.20 
(32.4%)1.3% ( -47% -   96%) 0.900
   BrowseDateSSDVFacets3.75 (27.0%)3.80 
(28.3%)1.3% ( -42% -   77%) 0.879
   Wildcard  113.02  (4.3%)  114.66  
(5.3%)1.5% (  -7% -   11%) 0.342
Prefix3   83.80  (7.4%)   85.97  
(7.3%)2.6% ( -11% -   18%) 0.266
 OrHighHigh  113.87  (3.9%)  156.08  
(8.9%)   37.1% (  23% -   51%) 0.000
  OrHighMed   92.87  (5.1%)  210.48 
(13.0%)  126.6% ( 103% -  152%) 0.000
   ```
   ```
   TaskQPS baseline  StdDevQPS 

[jira] [Comment Edited] (LUCENE-10480) Specialize 2-clauses disjunctions

2022-07-12 Thread Zach Chen (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17566149#comment-17566149
 ] 

Zach Chen edited comment on LUCENE-10480 at 7/13/22 5:09 AM:
-

{quote}I wouldn't say blocker, but maybe we could give us time indeed by only 
using this new scorer on top-level disjunctions for now so that we have more 
time to figure out whether we should stick to BMW or switch to BMM for inner 
disjunctions.
{quote}
Sounds good. I tried a few quick approaches to limit BMM scorer to top-level 
disjunctions in *BooleanWeight* or {*}Boolean2ScorerSupplier{*}, but they 
didn't work due to weight's / query's recursive logic. So I ended up wrapping 
the scorer inside a bulk scorer ([https://github.com/apache/lucene/pull/1018,] 
pending tests update) like your other PR. Please let me know if this approach 
looks good to you, or if there's a better approach. 


was (Author: zacharymorn):
{quote}I wouldn't say blocker, but maybe we could give us time indeed by only 
using this new scorer on top-level disjunctions for now so that we have more 
time to figure out whether we should stick to BMW or switch to BMM for inner 
disjunctions.
{quote}
Sounds good. I tried a few quick approaches to limit BMM scorer to top-level 
disjunctions in *BooleanWeight* or {*}Boolean2ScorerSupplier{*}, but they 
didn't work due to weight's / query's recursive logic. So I ended up wrapping 
the scorer inside a bulk scorer ([https://github.com/apache/lucene/pull/1018,] 
pending tests update) like your other PR. Please let me know if this approach 
looks good to you, or if there's a better approach. 

 

> Specialize 2-clauses disjunctions
> -
>
> Key: LUCENE-10480
> URL: https://issues.apache.org/jira/browse/LUCENE-10480
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Adrien Grand
>Priority: Minor
>  Time Spent: 7.5h
>  Remaining Estimate: 0h
>
> WANDScorer is nice, but it also has lots of overhead to maintain its 
> invariants: one linked list for the current candidates, one priority queue of 
> scorers that are behind, another one for scorers that are ahead. All this 
> could be simplified in the 2-clauses case, which feels worth specializing for 
> as it's very common that end users enter queries that only have two terms?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10480) Specialize 2-clauses disjunctions

2022-07-12 Thread Zach Chen (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17566149#comment-17566149
 ] 

Zach Chen commented on LUCENE-10480:


{quote}I wouldn't say blocker, but maybe we could give us time indeed by only 
using this new scorer on top-level disjunctions for now so that we have more 
time to figure out whether we should stick to BMW or switch to BMM for inner 
disjunctions.
{quote}
Sounds good. I tried a few quick approaches to limit BMM scorer to top-level 
disjunctions in *BooleanWeight* or {*}Boolean2ScorerSupplier{*}, but they 
didn't work due to weight's / query's recursive logic. So I ended up wrapping 
the scorer inside a bulk scorer ([https://github.com/apache/lucene/pull/1018,] 
pending tests update) like your other PR. Please let me know if this approach 
looks good to you, or if there's a better approach. 

 

> Specialize 2-clauses disjunctions
> -
>
> Key: LUCENE-10480
> URL: https://issues.apache.org/jira/browse/LUCENE-10480
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Adrien Grand
>Priority: Minor
>  Time Spent: 7.5h
>  Remaining Estimate: 0h
>
> WANDScorer is nice, but it also has lots of overhead to maintain its 
> invariants: one linked list for the current candidates, one priority queue of 
> scorers that are behind, another one for scorers that are ahead. All this 
> could be simplified in the 2-clauses case, which feels worth specializing for 
> as it's very common that end users enter queries that only have two terms?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] zacharymorn opened a new pull request, #1018: LUCENE-10480: Use BulkScorer to limit BMMScorer to only top-level disjunctions

2022-07-12 Thread GitBox


zacharymorn opened a new pull request, #1018:
URL: https://github.com/apache/lucene/pull/1018

   ### Description (or a Jira issue link if you have one)
   
   Use BulkScorer to limit BMMScorer to only top-level disjunctions
   
   Note: Tests update pending


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] msokolov commented on pull request #947: LUCENE-10577: enable quantization of HNSW vectors to 8 bits

2022-07-12 Thread GitBox


msokolov commented on PR #947:
URL: https://github.com/apache/lucene/pull/947#issuecomment-1182694202

   OK, this last round of commits moves the new vector encoding parameter out 
of IndexableField and FieldInfo into Codec constructor and internally to the 
codec, in FieldEntry. It certainly has less visible surface area now. I also 
merged from main and resolved a bunch of conflicts with the scoring change. I 
think it is correct (all the unit tests pass), but it wasn't trivial and I 
think it would be worth running some integration/performance tests just to make 
sure all is still well.
   
   There's a little bit of code duplication in HnswGraphSearcher where we now 
have the logic for switching from approximate to exact knn in two places that I 
don't like. Maybe that can be factored better? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-jira-archive] mocobeta closed issue #38: StackOverflowException on certain issue descriptions and comment text

2022-07-12 Thread GitBox


mocobeta closed issue #38: StackOverflowException on certain issue descriptions 
and comment text
URL: https://github.com/apache/lucene-jira-archive/issues/38


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-jira-archive] mocobeta merged pull request #39: Stack overflows can occur when parsing Jira lists

2022-07-12 Thread GitBox


mocobeta merged PR #39:
URL: https://github.com/apache/lucene-jira-archive/pull/39


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] gsmiller merged pull request #1010: Specialize ordinal encoding for SortedSetDocValues

2022-07-12 Thread GitBox


gsmiller merged PR #1010:
URL: https://github.com/apache/lucene/pull/1010


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (LUCENE-10654) New companion doc value format for LatLonShape and XYShape field types

2022-07-12 Thread Nick Knize (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10654?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Knize updated LUCENE-10654:

Fix Version/s: 9.3

> New companion doc value format for LatLonShape and XYShape field types
> --
>
> Key: LUCENE-10654
> URL: https://issues.apache.org/jira/browse/LUCENE-10654
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Nick Knize
>Priority: Major
> Fix For: 9.3
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> {{XYDocValuesField}} provides doc value support for {{XYPoint}}. 
> {{LatLonDocValuesField}} provides docvalue support for {{LatLonPoint}}.
> However, neither {{LatLonShape}} nor {{XYShape}} currently have a docvalue 
> format. 
> This lack of doc value support for shapes means facets, aggregations, and 
> IndexOrDocValues queries are currently not possible for Shape field types. 
> This gap needs be closed in lucene.
> To support IndexOrDocValues queries along with various geometry aggregations 
> and facets, the ability to compute the spatial relation with the doc value is 
> needed. This is straightforward with {{XYPoint}} and {{LatLonPoint}} since 
> the doc value encoding is nothing more than a simple 2D integer encoding of 
> the x,y and lat,lon dimensional components. Accomplishing the same with a 
> naive integer encoded binary representation for N-vertex shapes would be 
> costly. 
> {{ComponentTree}} already provides an efficient in memory structure for 
> quickly computing spatial relations over Shape types based on a binary tree 
> of tessellated triangles provided by the {{Tessellator}}. Furthermore, this 
> tessellation is already computed at index time. If we create an on-disk 
> representation of {{ComponentTree}} 's binary tree of tessellated triangles 
> and use this as the doc value {{binaryValue}} format we will be able to 
> efficiently compute spatial relations with this binary representation and 
> achieve the same facet/aggregation result over shapes as we can with points 
> today (e.g., grid facets, centroid, area, etc).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10649) Failure in TestDemoParallelLeafReader.testRandomMultipleSchemaGensSameField

2022-07-12 Thread Vigya Sharma (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10649?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17566071#comment-17566071
 ] 

Vigya Sharma commented on LUCENE-10649:
---

Great, thanks for confirming Adrien. I'll open a PR with the fix.

> Failure in TestDemoParallelLeafReader.testRandomMultipleSchemaGensSameField
> ---
>
> Key: LUCENE-10649
> URL: https://issues.apache.org/jira/browse/LUCENE-10649
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Vigya Sharma
>Priority: Major
>
> Failing Build Link: 
> [https://jenkins.thetaphi.de/job/Lucene-main-Linux/35617/testReport/junit/org.apache.lucene.index/TestDemoParallelLeafReader/testRandomMultipleSchemaGensSameField/]
> Repro:
> {code:java}
> gradlew test --tests 
> TestDemoParallelLeafReader.testRandomMultipleSchemaGensSameField 
> -Dtests.seed=A7496D7D3957981A -Dtests.multiplier=3 -Dtests.locale=sr-Latn-BA 
> -Dtests.timezone=Etc/GMT-7 -Dtests.asserts=true -Dtests.file.encoding=UTF-8 
> {code}
> Error:
> {code:java}
> java.lang.AssertionError: expected:<103> but was:<2147483647>
>     at 
> __randomizedtesting.SeedInfo.seed([A7496D7D3957981A:F71866BCCEA1C903]:0)
>     at org.junit.Assert.fail(Assert.java:89)
>     at org.junit.Assert.failNotEquals(Assert.java:835)
>     at org.junit.Assert.assertEquals(Assert.java:647)
>     at org.junit.Assert.assertEquals(Assert.java:633)
>     at 
> org.apache.lucene.index.TestDemoParallelLeafReader.testRandomMultipleSchemaGensSameField(TestDemoParallelLeafReader.java:1347)
>     at 
> java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:104)
>  {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] nknize opened a new pull request, #1017: LUCENE-10654: Add new ShapeDocValuesField for LatLonShape and XYShape

2022-07-12 Thread GitBox


nknize opened a new pull request, #1017:
URL: https://github.com/apache/lucene/pull/1017

   Adds new doc value field to support LatLonShape and XYShape doc values. The
   implementation is inspired by ComponentTree. A binary tree of tessellated
   components (point, line, or triangle) is created. This tree is then DFS
   serialized to a variable compressed DataOutput buffer to keep the doc value
   format as compact as possible.
   
   DocValue queries are performed on the serialized tree using a similar 
component
   relation logic as found in SpatialQuery for BKD indexed shapes. To make this
   possible some of the relation logic is refactored to make it accessible to 
the
   doc value query counterpart.
   
   Current limitations (to be addressed in follow up PR)
   
   1. Only Polygon Doc Values are tested
   2. CONTAINS relation not yet supported
   3. Only BoundingBox queries are supported (General Geometry Queries will be 
added in a follow on enhancement PR)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] Yuti-G commented on a diff in pull request #1013: LUCENE-10644: Facets#getAllChildren testing should ignore child order

2022-07-12 Thread GitBox


Yuti-G commented on code in PR #1013:
URL: https://github.com/apache/lucene/pull/1013#discussion_r919502708


##
lucene/facet/src/test/org/apache/lucene/facet/FacetTestCase.java:
##
@@ -264,4 +264,24 @@ protected void assertFloatValuesEquals(FacetResult a, 
FacetResult b) {
   a.labelValues[i].value.floatValue() / 1e5);
 }
   }
+
+  protected void assertNumericValuesEquals(Number a, Number b) {
+assertTrue(a.getClass().isInstance(b));
+if (a instanceof Float) {
+  assertEquals(a.floatValue(), b.floatValue(), a.floatValue() / 1e5);
+} else if (a instanceof Double) {
+  assertEquals(a.doubleValue(), b.doubleValue(), a.doubleValue() / 1e5);
+} else {
+  assertEquals(a, b);
+}
+  }
+
+  protected void assertAllChildrenEqualsWithoutOrdering(FacetResult a, 
FacetResult b) {

Review Comment:
   Thanks for the feedback! I addressed it in the new commit. Since we renamed 
the method to a generic name `assertFacetResult`, I added a comment `// assert 
children equal with no assumption of the children ordering` to inform future 
users in case they try to use this assert method but care about children 
ordering e.g., getTopChildren. Please let me know what you think. Thanks!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (LUCENE-10654) New companion doc value format for LatLonShape and XYShape field types

2022-07-12 Thread Nick Knize (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10654?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Knize updated LUCENE-10654:

Description: 
{{XYDocValuesField}} provides doc value support for {{XYPoint}}. 
{{LatLonDocValuesField}} provides docvalue support for {{LatLonPoint}}.
However, neither {{LatLonShape}} nor {{XYShape}} currently have a docvalue 
format. 
This lack of doc value support for shapes means facets, aggregations, and 
IndexOrDocValues queries are currently not possible for Shape field types. This 
gap needs be closed in lucene.

To support IndexOrDocValues queries along with various geometry aggregations 
and facets, the ability to compute the spatial relation with the doc value is 
needed. This is straightforward with {{XYPoint}} and {{LatLonPoint}} since the 
doc value encoding is nothing more than a simple 2D integer encoding of the x,y 
and lat,lon dimensional components. Accomplishing the same with a naive integer 
encoded binary representation for N-vertex shapes would be costly. 

{{ComponentTree}} already provides an efficient in memory structure for quickly 
computing spatial relations over Shape types based on a binary tree of 
tessellated triangles provided by the {{Tessellator}}. Furthermore, this 
tessellation is already computed at index time. If we create an on-disk 
representation of {{ComponentTree}} 's binary tree of tessellated triangles and 
use this as the doc value {{binaryValue}} format we will be able to efficiently 
compute spatial relations with this binary representation and achieve the same 
facet/aggregation result over shapes as we can with points today (e.g., grid 
facets, centroid, area, etc).

  was:
{{XYDocValuesField}} provides doc value support for {{XYPoint}}. 
{{LatLonDocValuesField}} provides docvalue support for {{LatLonPoint}}.
However, neither {{LatLonShape}} nor {{XYShape}} currently have a docvalue 
format. 
This lack of doc value support for shapes means facets, aggregations, and 
IndexOrDocValues queries are currently not possible for Shape field types. This 
gap needs be closed in lucene.

To support IndexOrDocValues queries along with various geometry aggregations 
and facets, the ability to compute the spatial relation with the doc value is 
needed. This is straightforward with {{XYPoint}} and {{LatLonPoint}} since the 
doc value encoding is nothing more than a simple 2D integer encoding of the x,y 
and lat,lon dimensional components. Accomplishing the same with a naive integer 
encoded binary representation for N-vertex shapes would be costly. 

{{ComponentTree}} already provides an efficient in memory structure for quickly 
computing spatial relations over Shape types based on a binary tree of 
tessellated triangles provided by the {{Tessellator}}. Furthermore, this 
tessellation is already computed at index time. If we create an on-disk 
representation of {{ComponentTree}}s binary tree of tessellated triangles and 
use this as the doc value {{binaryValue}} format we will be able to efficiently 
compute spatial relations with this binary representation and achieve the same 
facet/aggregation result over shapes as we can with points today (e.g., grid 
facets, centroid, area, etc).


> New companion doc value format for LatLonShape and XYShape field types
> --
>
> Key: LUCENE-10654
> URL: https://issues.apache.org/jira/browse/LUCENE-10654
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Nick Knize
>Priority: Major
>
> {{XYDocValuesField}} provides doc value support for {{XYPoint}}. 
> {{LatLonDocValuesField}} provides docvalue support for {{LatLonPoint}}.
> However, neither {{LatLonShape}} nor {{XYShape}} currently have a docvalue 
> format. 
> This lack of doc value support for shapes means facets, aggregations, and 
> IndexOrDocValues queries are currently not possible for Shape field types. 
> This gap needs be closed in lucene.
> To support IndexOrDocValues queries along with various geometry aggregations 
> and facets, the ability to compute the spatial relation with the doc value is 
> needed. This is straightforward with {{XYPoint}} and {{LatLonPoint}} since 
> the doc value encoding is nothing more than a simple 2D integer encoding of 
> the x,y and lat,lon dimensional components. Accomplishing the same with a 
> naive integer encoded binary representation for N-vertex shapes would be 
> costly. 
> {{ComponentTree}} already provides an efficient in memory structure for 
> quickly computing spatial relations over Shape types based on a binary tree 
> of tessellated triangles provided by the {{Tessellator}}. Furthermore, this 
> tessellation is already computed at index time. If we create an on-disk 
> representation of {{ComponentTree}} 's binary tree of tessellated triangles 
> and use this as the doc 

[jira] [Created] (LUCENE-10654) New companion doc value format for LatLonShape and XYShape field types

2022-07-12 Thread Nick Knize (Jira)
Nick Knize created LUCENE-10654:
---

 Summary: New companion doc value format for LatLonShape and 
XYShape field types
 Key: LUCENE-10654
 URL: https://issues.apache.org/jira/browse/LUCENE-10654
 Project: Lucene - Core
  Issue Type: New Feature
Reporter: Nick Knize


{{XYDocValuesField}} provides doc value support for {{XYPoint}}. 
{{LatLonDocValuesField}} provides docvalue support for {{LatLonPoint}}.
However, neither {{LatLonShape}} nor {{XYShape}} currently have a docvalue 
format. 
This lack of doc value support for shapes means facets, aggregations, and 
IndexOrDocValues queries are currently not possible for Shape field types. This 
gap needs be closed in lucene.

To support IndexOrDocValues queries along with various geometry aggregations 
and facets, the ability to compute the spatial relation with the doc value is 
needed. This is straightforward with {{XYPoint}} and {{LatLonPoint}} since the 
doc value encoding is nothing more than a simple 2D integer encoding of the x,y 
and lat,lon dimensional components. Accomplishing the same with a naive integer 
encoded binary representation for N-vertex shapes would be costly. 

{{ComponentTree}} already provides an efficient in memory structure for quickly 
computing spatial relations over Shape types based on a binary tree of 
tessellated triangles provided by the {{Tessellator}}. Furthermore, this 
tessellation is already computed at index time. If we create an on-disk 
representation of {{ComponentTree}}s binary tree of tessellated triangles and 
use this as the doc value {{binaryValue}} format we will be able to efficiently 
compute spatial relations with this binary representation and achieve the same 
facet/aggregation result over shapes as we can with points today (e.g., grid 
facets, centroid, area, etc).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10471) Increase the number of dims for KNN vectors to 2048

2022-07-12 Thread Mayya Sharipova (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17566026#comment-17566026
 ] 

Mayya Sharipova commented on LUCENE-10471:
--

[~sstolpovskiy]  [~sokolov] Thanks for providing your suggestions. It looks 
like we clearly see the need for upto 2048 dims for images, so I will be 
merging the linked PR. 

> Increase the number of dims for KNN vectors to 2048
> ---
>
> Key: LUCENE-10471
> URL: https://issues.apache.org/jira/browse/LUCENE-10471
> Project: Lucene - Core
>  Issue Type: Wish
>Reporter: Mayya Sharipova
>Priority: Trivial
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> The current maximum allowed number of dimensions is equal to 1024. But we see 
> in practice a couple well-known models that produce vectors with > 1024 
> dimensions (e.g 
> [mobilenet_v2|https://tfhub.dev/google/imagenet/mobilenet_v2_035_224/feature_vector/1]
>  uses 1280d vectors, OpenAI / GPT-3 Babbage uses 2048d vectors). Increasing 
> max dims to `2048` will satisfy these use cases.
> I am wondering if anybody has strong objections against this.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10577) Quantize vector values

2022-07-12 Thread Michael Sokolov (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17566015#comment-17566015
 ] 

Michael Sokolov commented on LUCENE-10577:
--

OK, that makes sense to me – I'll see about moving the setting to the 
`Lucene93HnswVectorsFormat`

> Quantize vector values
> --
>
> Key: LUCENE-10577
> URL: https://issues.apache.org/jira/browse/LUCENE-10577
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/codecs
>Reporter: Michael Sokolov
>Priority: Major
>  Time Spent: 2h
>  Remaining Estimate: 0h
>
> The {{KnnVectorField}} api handles vectors with 4-byte floating point values. 
> These fields can be used (via {{KnnVectorsReader}}) in two main ways:
> 1. The {{VectorValues}} iterator enables retrieving values
> 2. Approximate nearest -neighbor search
> The main point of this addition was to provide the search capability, and to 
> support that it is not really necessary to store vectors in full precision. 
> Perhaps users may also be willing to retrieve values in lower precision for 
> whatever purpose those serve, if they are able to store more samples. We know 
> that 8 bits is enough to provide a very near approximation to the same 
> recall/performance tradeoff that is achieved with the full-precision vectors. 
> I'd like to explore how we could enable 4:1 compression of these fields by 
> reducing their precision.
> A few ways I can imagine this would be done:
> 1. Provide a parallel byte-oriented API. This would allow users to provide 
> their data in reduced-precision format and give control over the quantization 
> to them. It would have a major impact on the Lucene API surface though, 
> essentially requiring us to duplicate all of the vector APIs.
> 2. Automatically quantize the stored vector data when we can. This would 
> require no or perhaps very limited change to the existing API to enable the 
> feature.
> I've been exploring (2), and what I find is that we can achieve very good 
> recall results using dot-product similarity scoring by simple linear scaling 
> + quantization of the vector values, so long as  we choose the scale that 
> minimizes the quantization error. Dot-product is amenable to this treatment 
> since vectors are required to be unit-length when used with that similarity 
> function. 
>  Even still there is variability in the ideal scale over different data sets. 
> A good choice seems to be max(abs(min-value), abs(max-value)), but of course 
> this assumes that the data set doesn't have a few outlier data points. A 
> theoretical range can be obtained by 1/sqrt(dimension), but this is only 
> useful when the samples are normally distributed. We could in theory 
> determine the ideal scale when flushing a segment and manage this 
> quantization per-segment, but then numerical error could creep in when 
> merging.
> I'll post a patch/PR with an experimental setup I've been using for 
> evaluation purposes. It is pretty self-contained and simple, but has some 
> drawbacks that need to be addressed:
> 1. No automated mechanism for determining quantization scale (it's a constant 
> that I have been playing with)
> 2. Converts from byte/float when computing dot-product instead of directly 
> computing on byte values
> I'd like to get people's feedback on the approach and whether in general we 
> should think about doing this compression under the hood, or expose a 
> byte-oriented API. Whatever we do I think a 4:1 compression ratio is pretty 
> compelling and we should pursue something.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jpountz commented on a diff in pull request #987: LUCENE-10627: Using CompositeByteBuf to Reduce Memory Copy

2022-07-12 Thread GitBox


jpountz commented on code in PR #987:
URL: https://github.com/apache/lucene/pull/987#discussion_r918752313


##
lucene/core/src/java/org/apache/lucene/codecs/compressing/CompressionMode.java:
##
@@ -257,9 +270,13 @@ private static class DeflateCompressor extends Compressor {
 }
 
 @Override
-public void compress(byte[] bytes, int off, int len, DataOutput out) 
throws IOException {
+public void compress(ByteBuffersDataInput buffersInput, int off, int len, 
DataOutput out)

Review Comment:
   Should we remove `off` and `len` and rely on callers to create a 
`ByteBuffersDataInput#slice` if they only need to compress a subset of the 
input?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-jira-archive] mikemccand commented on pull request #33: Polish wording of Legacy Jira details header, and each comment footer

2022-07-12 Thread GitBox


mikemccand commented on PR #33:
URL: 
https://github.com/apache/lucene-jira-archive/pull/33#issuecomment-1181660019

   Sorry -- not pushed to the PR yet -- struggling w/ git ;)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10619) Optimize the writeBytes in TermsHashPerField

2022-07-12 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10619?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17565873#comment-17565873
 ] 

ASF subversion and git services commented on LUCENE-10619:
--

Commit 9f9786122b487f992119f45c5d8a51a8d9d4a6f8 in lucene's branch 
refs/heads/branch_9x from tang donghai
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=9f9786122b4 ]

LUCENE-10619: Optimize the writeBytes in TermsHashPerField (#966)



> Optimize the writeBytes in TermsHashPerField
> 
>
> Key: LUCENE-10619
> URL: https://issues.apache.org/jira/browse/LUCENE-10619
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Affects Versions: 9.2
>Reporter: tang donghai
>Priority: Major
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> Because we don't know the length of slice, writeBytes will always write byte 
> one after another instead of writing a block of bytes.
> May be we could return both offset and length in ByteBlockPool#allocSlice?
> 1. BYTE_BLOCK_SIZE is 32768, offset is at most 15 bits.
> 2. slice size is at most 200, so it could fit in 8 bits.
> So we could put them together into an int  offset | length
> There are only two places where this function is used,the cost of change it 
> is relatively small.
> When allocSlice could return the offset and length of new Slice, we could 
> change writeBytes like below
> {code:java}
> // write block of bytes each time
> while(remaining > 0 ) {
>int offsetAndLength = allocSlice(bytes, offset);
>length = min(remaining, (offsetAndLength & 0xff) - 1);
>offset = offsetAndLength >> 8;
>System.arraycopy(src, srcPos, bytePool.buffer, offset, length);
>remaining -= length;
>offset+= (length + 1);
> }
> {code}
> If it could work, I'd like to raise a pr.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Resolved] (LUCENE-10619) Optimize the writeBytes in TermsHashPerField

2022-07-12 Thread Adrien Grand (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10619?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrien Grand resolved LUCENE-10619.
---
Fix Version/s: 9.3
   Resolution: Fixed

> Optimize the writeBytes in TermsHashPerField
> 
>
> Key: LUCENE-10619
> URL: https://issues.apache.org/jira/browse/LUCENE-10619
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Affects Versions: 9.2
>Reporter: tang donghai
>Priority: Major
> Fix For: 9.3
>
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> Because we don't know the length of slice, writeBytes will always write byte 
> one after another instead of writing a block of bytes.
> May be we could return both offset and length in ByteBlockPool#allocSlice?
> 1. BYTE_BLOCK_SIZE is 32768, offset is at most 15 bits.
> 2. slice size is at most 200, so it could fit in 8 bits.
> So we could put them together into an int  offset | length
> There are only two places where this function is used,the cost of change it 
> is relatively small.
> When allocSlice could return the offset and length of new Slice, we could 
> change writeBytes like below
> {code:java}
> // write block of bytes each time
> while(remaining > 0 ) {
>int offsetAndLength = allocSlice(bytes, offset);
>length = min(remaining, (offsetAndLength & 0xff) - 1);
>offset = offsetAndLength >> 8;
>System.arraycopy(src, srcPos, bytePool.buffer, offset, length);
>remaining -= length;
>offset+= (length + 1);
> }
> {code}
> If it could work, I'd like to raise a pr.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-jira-archive] mocobeta commented on pull request #39: Stack overflows can occur when parsing Jira lists

2022-07-12 Thread GitBox


mocobeta commented on PR #39:
URL: 
https://github.com/apache/lucene-jira-archive/pull/39#issuecomment-1181804695

   Thank you @mikemccand 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-jira-archive] mikemccand commented on pull request #33: Polish wording of Legacy Jira details header, and each comment footer

2022-07-12 Thread GitBox


mikemccand commented on PR #33:
URL: 
https://github.com/apache/lucene-jira-archive/pull/33#issuecomment-1181662032

   OK don't merge this -- I somehow messed up and slurped in unrelated (already 
previously committed/pushed) changes.  I have to drop off for now but will try 
to fix this a bit later ;)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-jira-archive] mocobeta commented on issue #38: StackOverflowException on certain issue descriptions and comment text

2022-07-12 Thread GitBox


mocobeta commented on issue #38:
URL: 
https://github.com/apache/lucene-jira-archive/issues/38#issuecomment-1181803770

   I'll merge it once I confirmed it parses all Jira without any errors. (I 
think nobody can review the quick and dirty fix...)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] tang-hi commented on pull request #966: LUCENE-10619: Optimize the writeBytes in TermsHashPerField

2022-07-12 Thread GitBox


tang-hi commented on PR #966:
URL: https://github.com/apache/lucene/pull/966#issuecomment-1181886902

   @jpountz  thanks for the suggestion  . I have changed testWriteBytes to 
write small chunks each time


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-jira-archive] mikemccand commented on a diff in pull request #39: Stack overflows can occur when parsing Jira lists

2022-07-12 Thread GitBox


mikemccand commented on code in PR #39:
URL: https://github.com/apache/lucene-jira-archive/pull/39#discussion_r919015037


##
migration/src/markup/lists.py:
##
@@ -40,6 +40,11 @@ def action(self, tokens: ParseResults) -> str:
 
 for line in tokens:
 # print(repr(line))
+if line == "\n":
+# can't really explain but if this is the first item, an empty 
string should be added to preserve line feed

Review Comment:
   LOL



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-jira-archive] mikemccand commented on pull request #33: Polish wording of Legacy Jira details header, and each comment footer

2022-07-12 Thread GitBox


mikemccand commented on PR #33:
URL: 
https://github.com/apache/lucene-jira-archive/pull/33#issuecomment-1181657754

   I pushed a small change to make a best-effort when we hit exceptions from 
the converter.  Such comments look like this: 
https://github.com/mikemccand/stargazers-migration-test/issues/52#issuecomment-1181652126
   
   But hopefully this new code never runs w/ @mocobeta's better fix for the 
infinite / slow recursion.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10619) Optimize the writeBytes in TermsHashPerField

2022-07-12 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10619?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17565872#comment-17565872
 ] 

ASF subversion and git services commented on LUCENE-10619:
--

Commit d7c2def019b8c1318d3c37a7065569e8d1a1af1f in lucene's branch 
refs/heads/main from tang donghai
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=d7c2def019b ]

LUCENE-10619: Optimize the writeBytes in TermsHashPerField (#966)



> Optimize the writeBytes in TermsHashPerField
> 
>
> Key: LUCENE-10619
> URL: https://issues.apache.org/jira/browse/LUCENE-10619
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Affects Versions: 9.2
>Reporter: tang donghai
>Priority: Major
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> Because we don't know the length of slice, writeBytes will always write byte 
> one after another instead of writing a block of bytes.
> May be we could return both offset and length in ByteBlockPool#allocSlice?
> 1. BYTE_BLOCK_SIZE is 32768, offset is at most 15 bits.
> 2. slice size is at most 200, so it could fit in 8 bits.
> So we could put them together into an int  offset | length
> There are only two places where this function is used,the cost of change it 
> is relatively small.
> When allocSlice could return the offset and length of new Slice, we could 
> change writeBytes like below
> {code:java}
> // write block of bytes each time
> while(remaining > 0 ) {
>int offsetAndLength = allocSlice(bytes, offset);
>length = min(remaining, (offsetAndLength & 0xff) - 1);
>offset = offsetAndLength >> 8;
>System.arraycopy(src, srcPos, bytePool.buffer, offset, length);
>remaining -= length;
>offset+= (length + 1);
> }
> {code}
> If it could work, I'd like to raise a pr.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-jira-archive] mikemccand opened a new issue, #38: StackOverflowException on certain issue descriptions and comment text

2022-07-12 Thread GitBox


mikemccand opened a new issue, #38:
URL: https://github.com/apache/lucene-jira-archive/issues/38

   Spinoff from #33.
   
   Some issues' text hit a stack overflow exception, e.g. one of the comments 
on LUCENE-550:
   
   ```
   (.venv) beast3:migration[polish_legacy_jira]$ python 
src/jira2github_import.py --min 1 --max 10649   

   [2022-07-11 15:01:02,826] INFO:jira2github_import: Converting Jira issues to 
GitHub issues in /l/jira-github-migration/migration/github-import-data  

   [2022-07-11 15:10:25,306] WARNING:jira2github_import: Jira dump file not 
found: /l/jira-github-migration/migration/jira-dump/LUCENE-498.json 




   ERROR: unhandled exception while converting LUCENE-550   





   Traceback (most recent call last):   


 File "/l/jira-github-migration/migration/src/jira2github_import.py", line 
229, in 
 
   convert_issue(num, dump_dir, output_dir, account_map, github_att_repo, 
github_att_branch)  
  
 File "/l/jira-github-migration/migration/src/jira2github_import.py", line 
133, in convert_issue   
 
   comment_body = f"""{convert_text(comment_body, att_replace_map, 
account_map)}   
 
 File "/l/jira-github-migration/migration/src/jira_util.py", line 216, in 
convert_text
  
   text = jira2markdown.convert(text, elements=elements)


 File 
"/l/jira-github-migration/.venv/lib/python3.10/site-packages/jira2markdown/parser.py",
 line 20, in convert

   return markup.transformString(text)  


 File 
"/l/jira-github-migration/.venv/lib/python3.10/site-packages/pyparsing.py", 
line 2059, in transformString   
  
   for t, s, e in self.scanString(instring):


 File 
"/l/jira-github-migration/.venv/lib/python3.10/site-packages/pyparsing.py", 
line 2007, in scanString
  
   nextLoc, tokens = parseFn(instring, preloc, callPreParse=False)  


 File 
"/l/jira-github-migration/.venv/lib/python3.10/site-packages/pyparsing.py", 
line 1683, in _parseNoCache 
  
   loc, tokens = self.parseImpl(instring, preloc, doActions)


 File 
"/l/jira-github-migration/.venv/lib/python3.10/site-packages/pyparsing.py", 
line 4462, in parseImpl 
  
   return self.expr._parse(instring, loc, doActions, callPreParse=False)
 

[GitHub] [lucene-jira-archive] mikemccand commented on issue #38: StackOverflowException on certain issue descriptions and comment text

2022-07-12 Thread GitBox


mikemccand commented on issue #38:
URL: 
https://github.com/apache/lucene-jira-archive/issues/38#issuecomment-1181596940

   Note that it is pretty rare -- when I ran the full conversion, I saw four 
separate occurrences.  Might not be so important to track down?  We can just 
carry over the raw text, escaped in MD code block, in such cases.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jpountz merged pull request #966: LUCENE-10619: Optimize the writeBytes in TermsHashPerField

2022-07-12 Thread GitBox


jpountz merged PR #966:
URL: https://github.com/apache/lucene/pull/966


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10577) Quantize vector values

2022-07-12 Thread Julie Tibshirani (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17565871#comment-17565871
 ] 

Julie Tibshirani commented on LUCENE-10577:
---

I checked out the latest PR changes, and I like the direction of using a new 
VectorEncoding class rather than squeezing this into VectorSimilarityFunction. 
I wonder if VectorEncoding should be a parameter on Lucene93HnswVectorsFormat 
though (alongside maxConn/ beamWidth) rather than on FieldInfo. My reasoning: 
FieldInfo contains information relevant to API consumers... but we want the 
encoding to be an internal detail of the format. Moreover, adding it to 
FieldInfo implies that all other vectors format implementations should support 
it. This could place more burden on implementing new formats. (I think this was 
part of the objection to me adding cosine similarity, that it increases the 
codec surface area without enough benefit? We discussed this in LUCENE-10191.)

> Quantize vector values
> --
>
> Key: LUCENE-10577
> URL: https://issues.apache.org/jira/browse/LUCENE-10577
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/codecs
>Reporter: Michael Sokolov
>Priority: Major
>  Time Spent: 2h
>  Remaining Estimate: 0h
>
> The {{KnnVectorField}} api handles vectors with 4-byte floating point values. 
> These fields can be used (via {{KnnVectorsReader}}) in two main ways:
> 1. The {{VectorValues}} iterator enables retrieving values
> 2. Approximate nearest -neighbor search
> The main point of this addition was to provide the search capability, and to 
> support that it is not really necessary to store vectors in full precision. 
> Perhaps users may also be willing to retrieve values in lower precision for 
> whatever purpose those serve, if they are able to store more samples. We know 
> that 8 bits is enough to provide a very near approximation to the same 
> recall/performance tradeoff that is achieved with the full-precision vectors. 
> I'd like to explore how we could enable 4:1 compression of these fields by 
> reducing their precision.
> A few ways I can imagine this would be done:
> 1. Provide a parallel byte-oriented API. This would allow users to provide 
> their data in reduced-precision format and give control over the quantization 
> to them. It would have a major impact on the Lucene API surface though, 
> essentially requiring us to duplicate all of the vector APIs.
> 2. Automatically quantize the stored vector data when we can. This would 
> require no or perhaps very limited change to the existing API to enable the 
> feature.
> I've been exploring (2), and what I find is that we can achieve very good 
> recall results using dot-product similarity scoring by simple linear scaling 
> + quantization of the vector values, so long as  we choose the scale that 
> minimizes the quantization error. Dot-product is amenable to this treatment 
> since vectors are required to be unit-length when used with that similarity 
> function. 
>  Even still there is variability in the ideal scale over different data sets. 
> A good choice seems to be max(abs(min-value), abs(max-value)), but of course 
> this assumes that the data set doesn't have a few outlier data points. A 
> theoretical range can be obtained by 1/sqrt(dimension), but this is only 
> useful when the samples are normally distributed. We could in theory 
> determine the ideal scale when flushing a segment and manage this 
> quantization per-segment, but then numerical error could creep in when 
> merging.
> I'll post a patch/PR with an experimental setup I've been using for 
> evaluation purposes. It is pretty self-contained and simple, but has some 
> drawbacks that need to be addressed:
> 1. No automated mechanism for determining quantization scale (it's a constant 
> that I have been playing with)
> 2. Converts from byte/float when computing dot-product instead of directly 
> computing on byte values
> I'd like to get people's feedback on the approach and whether in general we 
> should think about doing this compression under the hood, or expose a 
> byte-oriented API. Whatever we do I think a 4:1 compression ratio is pretty 
> compelling and we should pursue something.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Comment Edited] (LUCENE-10577) Quantize vector values

2022-07-12 Thread Julie Tibshirani (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17565919#comment-17565919
 ] 

Julie Tibshirani edited comment on LUCENE-10577 at 7/12/22 4:23 PM:


I wasn't suggesting making it entirely an internal detail, I just suggested 
moving the VectorEncoding configuration from FieldInfo (where it currently is 
in your PR) to the Lucene93HnswVectorsFormat constructor. It would still be 
user-configurable and have a good default, just like maxConn and beamWidth. I 
think I agree it would be complicated (and maybe unclear for users) if we tried 
to do it under-the-hood with no user config at all.

Edit: and yes, exactly -- PerFieldKnnVectorsFormat is what allows you to have 
different format configuration parameters for different vector fields. You can 
use it, for example, to set maxConn=16 for one field, and maxConn=32 for some 
other field.


was (Author: julietibs):
I wasn't suggesting making it entirely an internal detail, I just suggested 
moving the VectorEncoding configuration from FieldInfo (where it currently is 
in your PR) to the Lucene93HnswVectorsFormat constructor. It would still be 
user-configurable and have a good default, just like maxConn and beamWidth. I 
think I agree it would be complicated (and maybe unclear for users) if we tried 
to do it under-the-hood with no user config at all.

> Quantize vector values
> --
>
> Key: LUCENE-10577
> URL: https://issues.apache.org/jira/browse/LUCENE-10577
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/codecs
>Reporter: Michael Sokolov
>Priority: Major
>  Time Spent: 2h
>  Remaining Estimate: 0h
>
> The {{KnnVectorField}} api handles vectors with 4-byte floating point values. 
> These fields can be used (via {{KnnVectorsReader}}) in two main ways:
> 1. The {{VectorValues}} iterator enables retrieving values
> 2. Approximate nearest -neighbor search
> The main point of this addition was to provide the search capability, and to 
> support that it is not really necessary to store vectors in full precision. 
> Perhaps users may also be willing to retrieve values in lower precision for 
> whatever purpose those serve, if they are able to store more samples. We know 
> that 8 bits is enough to provide a very near approximation to the same 
> recall/performance tradeoff that is achieved with the full-precision vectors. 
> I'd like to explore how we could enable 4:1 compression of these fields by 
> reducing their precision.
> A few ways I can imagine this would be done:
> 1. Provide a parallel byte-oriented API. This would allow users to provide 
> their data in reduced-precision format and give control over the quantization 
> to them. It would have a major impact on the Lucene API surface though, 
> essentially requiring us to duplicate all of the vector APIs.
> 2. Automatically quantize the stored vector data when we can. This would 
> require no or perhaps very limited change to the existing API to enable the 
> feature.
> I've been exploring (2), and what I find is that we can achieve very good 
> recall results using dot-product similarity scoring by simple linear scaling 
> + quantization of the vector values, so long as  we choose the scale that 
> minimizes the quantization error. Dot-product is amenable to this treatment 
> since vectors are required to be unit-length when used with that similarity 
> function. 
>  Even still there is variability in the ideal scale over different data sets. 
> A good choice seems to be max(abs(min-value), abs(max-value)), but of course 
> this assumes that the data set doesn't have a few outlier data points. A 
> theoretical range can be obtained by 1/sqrt(dimension), but this is only 
> useful when the samples are normally distributed. We could in theory 
> determine the ideal scale when flushing a segment and manage this 
> quantization per-segment, but then numerical error could creep in when 
> merging.
> I'll post a patch/PR with an experimental setup I've been using for 
> evaluation purposes. It is pretty self-contained and simple, but has some 
> drawbacks that need to be addressed:
> 1. No automated mechanism for determining quantization scale (it's a constant 
> that I have been playing with)
> 2. Converts from byte/float when computing dot-product instead of directly 
> computing on byte values
> I'd like to get people's feedback on the approach and whether in general we 
> should think about doing this compression under the hood, or expose a 
> byte-oriented API. Whatever we do I think a 4:1 compression ratio is pretty 
> compelling and we should pursue something.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: 

[jira] [Commented] (LUCENE-10577) Quantize vector values

2022-07-12 Thread Julie Tibshirani (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17565919#comment-17565919
 ] 

Julie Tibshirani commented on LUCENE-10577:
---

I wasn't suggesting making it entirely an internal detail, I just suggested 
moving the VectorEncoding configuration from FieldInfo (where it currently is 
in your PR) to the Lucene93HnswVectorsFormat constructor. It would still be 
user-configurable and have a good default, just like maxConn and beamWidth. I 
think I agree it would be complicated (and maybe unclear for users) if we tried 
to do it under-the-hood with no user config at all.

> Quantize vector values
> --
>
> Key: LUCENE-10577
> URL: https://issues.apache.org/jira/browse/LUCENE-10577
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/codecs
>Reporter: Michael Sokolov
>Priority: Major
>  Time Spent: 2h
>  Remaining Estimate: 0h
>
> The {{KnnVectorField}} api handles vectors with 4-byte floating point values. 
> These fields can be used (via {{KnnVectorsReader}}) in two main ways:
> 1. The {{VectorValues}} iterator enables retrieving values
> 2. Approximate nearest -neighbor search
> The main point of this addition was to provide the search capability, and to 
> support that it is not really necessary to store vectors in full precision. 
> Perhaps users may also be willing to retrieve values in lower precision for 
> whatever purpose those serve, if they are able to store more samples. We know 
> that 8 bits is enough to provide a very near approximation to the same 
> recall/performance tradeoff that is achieved with the full-precision vectors. 
> I'd like to explore how we could enable 4:1 compression of these fields by 
> reducing their precision.
> A few ways I can imagine this would be done:
> 1. Provide a parallel byte-oriented API. This would allow users to provide 
> their data in reduced-precision format and give control over the quantization 
> to them. It would have a major impact on the Lucene API surface though, 
> essentially requiring us to duplicate all of the vector APIs.
> 2. Automatically quantize the stored vector data when we can. This would 
> require no or perhaps very limited change to the existing API to enable the 
> feature.
> I've been exploring (2), and what I find is that we can achieve very good 
> recall results using dot-product similarity scoring by simple linear scaling 
> + quantization of the vector values, so long as  we choose the scale that 
> minimizes the quantization error. Dot-product is amenable to this treatment 
> since vectors are required to be unit-length when used with that similarity 
> function. 
>  Even still there is variability in the ideal scale over different data sets. 
> A good choice seems to be max(abs(min-value), abs(max-value)), but of course 
> this assumes that the data set doesn't have a few outlier data points. A 
> theoretical range can be obtained by 1/sqrt(dimension), but this is only 
> useful when the samples are normally distributed. We could in theory 
> determine the ideal scale when flushing a segment and manage this 
> quantization per-segment, but then numerical error could creep in when 
> merging.
> I'll post a patch/PR with an experimental setup I've been using for 
> evaluation purposes. It is pretty self-contained and simple, but has some 
> drawbacks that need to be addressed:
> 1. No automated mechanism for determining quantization scale (it's a constant 
> that I have been playing with)
> 2. Converts from byte/float when computing dot-product instead of directly 
> computing on byte values
> I'd like to get people's feedback on the approach and whether in general we 
> should think about doing this compression under the hood, or expose a 
> byte-oriented API. Whatever we do I think a 4:1 compression ratio is pretty 
> compelling and we should pursue something.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10628) Enable MatchingFacetSetCounts to use space partitioning data structures

2022-07-12 Thread Marc D'Mello (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10628?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17565944#comment-17565944
 ] 

Marc D'Mello commented on LUCENE-10628:
---

Thanks for taking a look! As for the answer to your question - I'm not sure, 
it's really up to the user. The facet set matchers can theoretically be an 
unlimited amount of dimensions. Just to make sure we are on the same page, I'll 
just define the relevant parts of the {{facetset}} package API (apologize if 
I'm just repeating information that you already know here):

Essentially we store multi-dim points into a {{BinaryDocValues}} field, so for 
example a list like {{(1, 2, 3), (2, 3, 4), (3, 4, 5)...}}. We have an 
{{ExactFacetSetMatcher}} that represents a single point of the same dimension 
as the field, and we will count how many points in the BDV match that point 
represented by the {{ExactFacetSetMatcher}}. We can put multiple of these 
{{ExactFacetSetMatcher}}'s into a group in {{MatchingFacetSetCounts}} and count 
how points in the BDV matched each {{ExactFacetSetMatcher}}. Currently, we 
linearly scan each point through each {{ExactFacetMatcher}} to get the counts, 
which is the part I want to optimize by putting the {{ExactFacetSetMatcher}}'s 
into a space partitioning data structure (either putting these into a KD tree, 
or as you suggested an interval tree). We also have {{RangeFacetSetMatcher}} 
which is similar to {{ExactFacetSetMatcher}}, except you can define ranges per 
dimension, for example something like {{(1 - 3, 3 - 4, 4 - 6)}} which would 
match if all of a point's dimensions lie within the range. I imagine you can 
put a group of these {{RangeFacetSetMatcher}}'s into an R tree to avoid linear 
scanning.

So I'd imagine most use cases would be with a low dimensionality, but there 
might be some use cases that require higher dimensions. In the higher dimension 
case, would it just be best to resort to linear scanning then rather than 
building a KD tree? For the {{RangeFacetSetMatcher}} case, would bulk adding 
these into an R tree also be too complex? In the common use case, there would 
be many more points in the index than {{FacetSetMatcher}}'s, so another 
approach could also be to index the points in a KD tree. Sorry for all the 
questions! But I would be really interested in any suggestions you have here as 
I am inexperienced with these kind of data structures. 

> Enable MatchingFacetSetCounts to use space partitioning data structures
> ---
>
> Key: LUCENE-10628
> URL: https://issues.apache.org/jira/browse/LUCENE-10628
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Marc D'Mello
>Priority: Minor
>
> Currently, {{MatchingFacetSetCounts}} iterates over {{FacetSetMatcher}} 
> instances passed into it linearly. While this is fine in some cases, if we 
> have a large amount of {{FacetSetMatcher}}'s, this can be inefficient. We 
> should provide the option to users to enable the use of space partitioning 
> data structures (namely R trees and KD trees) so we can potentially scan over 
> these {{FacetSetMatcher}}'s in sub-linear time.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10650) "after_effect": "no" was removed what replaces it?

2022-07-12 Thread Adrien Grand (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17565384#comment-17565384
 ] 

Adrien Grand commented on LUCENE-10650:
---

{{query.boost}} is the {{query.getBoost()}} from DFRSimilarity's {{double 
score(BasicStats stats, double freq, double docLen)}}, which does 
{{stats.getBoost() * basicModel.score(stats, tfn, aeTimes1pTfn)}}.

The division by log(2) is not the tfn but a way to turn Math.log, which is a 
log in base 10 into a log in base 2.

I wouldn't expect latency to be higher, this should get compiled to more or 
less the same code that you used to rely on in DFRSimilarity.

> "after_effect": "no" was removed what replaces it?
> --
>
> Key: LUCENE-10650
> URL: https://issues.apache.org/jira/browse/LUCENE-10650
> Project: Lucene - Core
>  Issue Type: Wish
>Reporter: Nathan Meisels
>Priority: Major
>
> Hi!
> We have been using an old version of elasticsearch with the following 
> settings:
>  
> {code:java}
>         "default": {
>           "queryNorm": "1",
>           "type": "DFR",
>           "basic_model": "in",
>           "after_effect": "no",
>           "normalization": "no"
>         }{code}
>  
> I see [here|https://issues.apache.org/jira/browse/LUCENE-8015] that 
> "after_effect": "no" was removed.
> In 
> [old|https://github.com/apache/lucene-solr/blob/releases/lucene-solr/5.5.0/lucene/core/src/java/org/apache/lucene/search/similarities/BasicModelIn.java#L33]
>  version score was:
> {code:java}
> return tfn * (float)(log2((N + 1) / (n + 0.5)));{code}
> In 
> [new|https://github.com/apache/lucene-solr/blob/releases/lucene-solr/8.11.2/lucene/core/src/java/org/apache/lucene/search/similarities/BasicModelIn.java#L43]
>  version it's:
> {code:java}
> long N = stats.getNumberOfDocuments();
> long n = stats.getDocFreq();
> double A = log2((N + 1) / (n + 0.5));
> // basic model I should return A * tfn
> // which we rewrite to A * (1 + tfn) - A
> // so that it can be combined with the after effect while still guaranteeing
> // that the result is non-decreasing with tfn
> return A * aeTimes1pTfn * (1 - 1 / (1 + tfn));
> {code}
> I tried changing {color:#172b4d}after_effect{color} to "l" but the scoring is 
> different than what we are used to. (We depend heavily on the exact scoring).
> Do you have any advice how we can keep the same scoring as before?
> Thanks



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jpountz commented on pull request #987: LUCENE-10627: Using CompositeByteBuf to Reduce Memory Copy

2022-07-12 Thread GitBox


jpountz commented on PR #987:
URL: https://github.com/apache/lucene/pull/987#issuecomment-1181718918

   > if we only using compress method with variants ByteBuffersDataInput in 
LUCENE90, we can not using abstract method Compressor.compress, when we want to 
use other compression mode.
   
   I think that this downside is fine? We prefer codecs to evolve independently 
so when we start needing changes for a new codec, we prefer to fork the code so 
that old codecs still rely on the unchanged code (which should move to 
lucene/backward-codecs) while the new codecs only use the new code without 
carrying over legacy code.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-jira-archive] mikemccand opened a new issue, #37: Why are some Jira issues completely missing?

2022-07-12 Thread GitBox


mikemccand opened a new issue, #37:
URL: https://github.com/apache/lucene-jira-archive/issues/37

   Spinoff from #33.
   
   This is not a blocker for migration, more because I'm curious how Jira lost 
issues and how pervasive this problem might be -- maybe other Apache projects 
are affected?  Or maybe we are doing something wrong in Lucene ;)
   
   Some Jira issues in the sequential numbering from 1 .. N just don't seem to 
exist, and seem to have never existing (Google searching, jirasearch, Jira's 
own (Lucene based!) search engine, my long email archive also fail to find at 
least one of them):
   
   ```
   [2022-07-11 07:57:25,815] WARNING:download_jira: Can't download LUCENE-498. 
status code=404, message={"errorMessages":["Issue Does Not Exist"],"errors":{}}
   [2022-07-11 07:59:10,096] WARNING:download_jira: Can't download LUCENE-613. 
status code=404, message={"errorMessages":["Issue Does Not Exist"],"errors":{}}
   [2022-07-11 07:59:10,978] WARNING:download_jira: Can't download LUCENE-614. 
status code=404, message={"errorMessages":["Issue Does Not Exist"],"errors":{}}
   [2022-07-11 07:59:13,615] WARNING:download_jira: Can't download LUCENE-617. 
status code=404, message={"errorMessages":["Issue Does Not Exist"],"errors":{}}
   [2022-07-11 08:10:36,059] WARNING:download_jira: Can't download LUCENE-1362. 
status code=404, message={"errorMessages":["Issue Does Not Exist"],"errors":{}}
   [2022-07-11 08:10:36,932] WARNING:download_jira: Can't download LUCENE-1363. 
status code=404, message={"errorMessages":["Issue Does Not Exist"],"errors":{}}
   [2022-07-11 08:10:37,798] WARNING:download_jira: Can't download LUCENE-1364. 
status code=404, message={"errorMessages":["Issue Does Not Exist"],"errors":{}}
   [2022-07-11 08:26:22,112] WARNING:download_jira: Can't download LUCENE-2375. 
status code=404, message={"errorMessages":["Issue Does Not Exist"],"errors":{}}
   [2022-07-11 08:27:02,304] WARNING:download_jira: Can't download LUCENE-2418. 
status code=404, message={"errorMessages":["Issue Does Not Exist"],"errors":{}}
   ```
   
   Maybe Jira has some concurrency bug in how it numbers issues and sometimes 
leaves holes?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-jira-archive] mikemccand commented on pull request #33: Polish wording of Legacy Jira details header, and each comment footer

2022-07-12 Thread GitBox


mikemccand commented on PR #33:
URL: 
https://github.com/apache/lucene-jira-archive/pull/33#issuecomment-1181586767

   And thank you for the quick fix!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-jira-archive] mikemccand commented on pull request #33: Polish wording of Legacy Jira details header, and each comment footer

2022-07-12 Thread GitBox


mikemccand commented on PR #33:
URL: 
https://github.com/apache/lucene-jira-archive/pull/33#issuecomment-1181589626

   > It looks like a bug introduced in 
[cfbc821](https://github.com/apache/lucene-jira-archive/commit/cfbc821390859a7053e43028325b6bc616ec2b5b).
 (I have postponed testing it with the whole Jira dump.)
   > I'll take a look at it.
   
   Thanks for chasing this down -- I'll open a spinoff issue to track progress.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-jira-archive] mocobeta commented on issue #36: Can we parallelize the converter script?

2022-07-12 Thread GitBox


mocobeta commented on issue #36:
URL: 
https://github.com/apache/lucene-jira-archive/issues/36#issuecomment-1181522090

   
https://docs.python.org/3/howto/logging-cookbook.html#logging-to-a-single-file-from-multiple-processes


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-jira-archive] mikemccand commented on pull request #33: Polish wording of Legacy Jira details header, and each comment footer

2022-07-12 Thread GitBox


mikemccand commented on PR #33:
URL: 
https://github.com/apache/lucene-jira-archive/pull/33#issuecomment-1181586644

   > Sorry there should have been a "catch all" try~except clause. I made a 
quick fix in #35.
   
   No worries at all!  No need to apologize!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10577) Quantize vector values

2022-07-12 Thread Michael Sokolov (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17565914#comment-17565914
 ] 

Michael Sokolov commented on LUCENE-10577:
--

It would be nice if we could make this encoding an *entirely* internal detail, 
with no user-configuration, but I don't think we can, because:
 # the choice of quantization scaling factor has a significant impact on the 
lossiness and thus recall. It needs to be tuned for each dataset
 # Even if we were able to do this tuning in Lucene, automatically, we would 
have to do it per-segment, and then when we merge, we'd have to re-scale, and 
we would lose more precision then.

Because of this, I think we need to expose the ability for users to provide 
quantized data, and then they need some way of specifying for a given field 
whether it is byte-encoded or float-encoded. Although I do see that it could be 
done using the PerFieldKnnVectorsFormat - is that what you were saying, 
[~julietibs] ?

 

> Quantize vector values
> --
>
> Key: LUCENE-10577
> URL: https://issues.apache.org/jira/browse/LUCENE-10577
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/codecs
>Reporter: Michael Sokolov
>Priority: Major
>  Time Spent: 2h
>  Remaining Estimate: 0h
>
> The {{KnnVectorField}} api handles vectors with 4-byte floating point values. 
> These fields can be used (via {{KnnVectorsReader}}) in two main ways:
> 1. The {{VectorValues}} iterator enables retrieving values
> 2. Approximate nearest -neighbor search
> The main point of this addition was to provide the search capability, and to 
> support that it is not really necessary to store vectors in full precision. 
> Perhaps users may also be willing to retrieve values in lower precision for 
> whatever purpose those serve, if they are able to store more samples. We know 
> that 8 bits is enough to provide a very near approximation to the same 
> recall/performance tradeoff that is achieved with the full-precision vectors. 
> I'd like to explore how we could enable 4:1 compression of these fields by 
> reducing their precision.
> A few ways I can imagine this would be done:
> 1. Provide a parallel byte-oriented API. This would allow users to provide 
> their data in reduced-precision format and give control over the quantization 
> to them. It would have a major impact on the Lucene API surface though, 
> essentially requiring us to duplicate all of the vector APIs.
> 2. Automatically quantize the stored vector data when we can. This would 
> require no or perhaps very limited change to the existing API to enable the 
> feature.
> I've been exploring (2), and what I find is that we can achieve very good 
> recall results using dot-product similarity scoring by simple linear scaling 
> + quantization of the vector values, so long as  we choose the scale that 
> minimizes the quantization error. Dot-product is amenable to this treatment 
> since vectors are required to be unit-length when used with that similarity 
> function. 
>  Even still there is variability in the ideal scale over different data sets. 
> A good choice seems to be max(abs(min-value), abs(max-value)), but of course 
> this assumes that the data set doesn't have a few outlier data points. A 
> theoretical range can be obtained by 1/sqrt(dimension), but this is only 
> useful when the samples are normally distributed. We could in theory 
> determine the ideal scale when flushing a segment and manage this 
> quantization per-segment, but then numerical error could creep in when 
> merging.
> I'll post a patch/PR with an experimental setup I've been using for 
> evaluation purposes. It is pretty self-contained and simple, but has some 
> drawbacks that need to be addressed:
> 1. No automated mechanism for determining quantization scale (it's a constant 
> that I have been playing with)
> 2. Converts from byte/float when computing dot-product instead of directly 
> computing on byte values
> I'd like to get people's feedback on the approach and whether in general we 
> should think about doing this compression under the hood, or expose a 
> byte-oriented API. Whatever we do I think a 4:1 compression ratio is pretty 
> compelling and we should pursue something.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-jira-archive] mikemccand commented on pull request #33: Polish wording of Legacy Jira details header, and each comment footer

2022-07-12 Thread GitBox


mikemccand commented on PR #33:
URL: 
https://github.com/apache/lucene-jira-archive/pull/33#issuecomment-1181587514

   > I'm also converting the whole Jira issue myself; it looks like it takes 
several hours... (recent changes to fix conversion errors could affect the 
conversion speed I think). This shouldn't be so slow, raised #36.
   
   Thanks -- I was beginning to wonder if it was normal how long it was taking 
;)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-jira-archive] mikemccand merged pull request #40: #27: polish the legacy Jira text added to the issue a bit

2022-07-12 Thread GitBox


mikemccand merged PR #40:
URL: https://github.com/apache/lucene-jira-archive/pull/40


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jpountz commented on pull request #907: LUCENE-10357 Ghost fields and postings/points

2022-07-12 Thread GitBox


jpountz commented on PR #907:
URL: https://github.com/apache/lucene/pull/907#issuecomment-1181518177

   @shahrs87 Can you look into removing all other instances of `terms == 
Terms.EMPTY` or `terms != Terms.EMPTY` as well? To do this while keeping tests 
passing, I think you'll need to create empty `Terms` instances that still honor 
the options of the `FieldInfo` as per my previous suggestion. E.g. you could 
add a new `Terms#empty(FieldInfo)` helper method that does the right thing for 
`hasFreqs()`, `hasPositions()`, etc. by looking at the index options of the 
`FieldInfo`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-jira-archive] mikemccand commented on issue #38: StackOverflowException on certain issue descriptions and comment text

2022-07-12 Thread GitBox


mikemccand commented on issue #38:
URL: 
https://github.com/apache/lucene-jira-archive/issues/38#issuecomment-1181644356

   > I'm trying to find other ways that do not cause infinite recursion while 
parsing lists correctly.
   
   Awesome, thanks @mocobeta!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] mayya-sharipova commented on a diff in pull request #992: LUCENE-10592 Build HNSW Graph on indexing

2022-07-12 Thread GitBox


mayya-sharipova commented on code in PR #992:
URL: https://github.com/apache/lucene/pull/992#discussion_r919332844


##
lucene/core/src/java/org/apache/lucene/codecs/perfield/PerFieldKnnVectorsFormat.java:
##
@@ -102,9 +104,22 @@ private class FieldsWriter extends KnnVectorsWriter {
 }
 
 @Override
-public void writeField(FieldInfo fieldInfo, KnnVectorsReader 
knnVectorsReader)
+public KnnFieldVectorsWriter addField(FieldInfo fieldInfo) throws 
IOException {
+  KnnVectorsWriter writer = getInstance(fieldInfo);
+  return writer.addField(fieldInfo);
+}
+
+@Override
+public void flush(int maxDoc, Sorter.DocMap sortMap) throws IOException {
+  for (WriterAndSuffix was : formats.values()) {
+was.writer.flush(maxDoc, sortMap);
+  }
+}
+
+@Override
+public void mergeOneField(FieldInfo fieldInfo, KnnVectorsReader 
knnVectorsReader)
 throws IOException {
-  getInstance(fieldInfo).writeField(fieldInfo, knnVectorsReader);
+  getInstance(fieldInfo).mergeOneField(fieldInfo, knnVectorsReader);

Review Comment:
   @jtibshirani  Thanks for the suggestion, but we can't at the same time make 
`KnnVectorsWriter#merge` final and also un-support  this `mergeOneField` in 
`PerFieldKnnVectorsFormat` as `KnnVectorsWriter#merge`  calls the corresponding 
`mergeOneField`. Should we keep `mergeOneField` and make 
`KnnVectorsWriter#merge` final as your other suggestion?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] mayya-sharipova commented on pull request #992: LUCENE-10592 Build HNSW Graph on indexing

2022-07-12 Thread GitBox


mayya-sharipova commented on PR #992:
URL: https://github.com/apache/lucene/pull/992#issuecomment-1182388563

   @jtibshirani @jpountz  Thank for your review. I've tried to address your 
comments, but it looks like we are still not clear how to organize `merge` and 
`flush` methods.  Would be nice if you can provide further comments. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10653) Should BlockMaxMaxscoreScorer rebuild its heap in bulk?

2022-07-12 Thread Adrien Grand (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17565380#comment-17565380
 ] 

Adrien Grand commented on LUCENE-10653:
---

+1 to doing a bulk heapify

The fact that this scorer only handles 2 clauses for now is only a way to give 
us more time to evaluate when we should use it vs. WANDScorer in my opinion. 
Most likely it will be used for more than 2 clauses at some point in the future.

> Should BlockMaxMaxscoreScorer rebuild its heap in bulk?
> ---
>
> Key: LUCENE-10653
> URL: https://issues.apache.org/jira/browse/LUCENE-10653
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/search
>Reporter: Greg Miller
>Priority: Minor
>
> BMMScorer has to frequently rebuild its heap, and does do by clearing and 
> then iteratively calling {{{}add{}}}. It would be more efficient to heapify 
> in bulk. This is more academic than anything right now though since BMMScorer 
> is only used with two-clause disjunctions, so it's sort of a silly 
> optimization if it's not supporting a greater number of clauses.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] mayya-sharipova commented on a diff in pull request #992: LUCENE-10592 Build HNSW Graph on indexing

2022-07-12 Thread GitBox


mayya-sharipova commented on code in PR #992:
URL: https://github.com/apache/lucene/pull/992#discussion_r919349095


##
lucene/core/src/java/org/apache/lucene/index/VectorValuesWriter.java:
##
@@ -26,233 +26,153 @@
 import org.apache.lucene.codecs.KnnVectorsWriter;
 import org.apache.lucene.search.DocIdSetIterator;
 import org.apache.lucene.search.TopDocs;
+import org.apache.lucene.util.Accountable;
 import org.apache.lucene.util.ArrayUtil;
 import org.apache.lucene.util.Bits;
 import org.apache.lucene.util.BytesRef;
-import org.apache.lucene.util.Counter;
 import org.apache.lucene.util.RamUsageEstimator;
 
 /**
- * Buffers up pending vector value(s) per doc, then flushes when segment 
flushes.
+ * Buffers up pending vector value(s) per doc, then flushes when segment 
flushes. Used for {@code
+ * SimpleTextKnnVectorsWriter} and for vectors writers before v 9.3 .
  *
  * @lucene.experimental
  */
-class VectorValuesWriter {
-
-  private final FieldInfo fieldInfo;
-  private final Counter iwBytesUsed;
-  private final List vectors = new ArrayList<>();
-  private final DocsWithFieldSet docsWithField;
-
-  private int lastDocID = -1;
-
-  private long bytesUsed;
-
-  VectorValuesWriter(FieldInfo fieldInfo, Counter iwBytesUsed) {
-this.fieldInfo = fieldInfo;
-this.iwBytesUsed = iwBytesUsed;
-this.docsWithField = new DocsWithFieldSet();
-this.bytesUsed = docsWithField.ramBytesUsed();
-if (iwBytesUsed != null) {
-  iwBytesUsed.addAndGet(bytesUsed);
+public abstract class VectorValuesWriter extends KnnVectorsWriter {

Review Comment:
   @jtibshirani  Thanks for the suggestion,  I understood it and indeed I think 
it is a good idea. I will hold off implementing it until we finalize how we 
want to organize `KnnVectorsWriter` and `KnnFieldVectorsWriter` classes, so we 
can implement `SimpleTextKnnVectorsWriter` in the same way.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-jira-archive] mocobeta commented on issue #38: StackOverflowException on certain issue descriptions and comment text

2022-07-12 Thread GitBox


mocobeta commented on issue #38:
URL: 
https://github.com/apache/lucene-jira-archive/issues/38#issuecomment-1181776008

   I opened #39. I cannot really explain _why the ad-hoc fix works_ but it 
works. I think there should be a better way though, it would be sufficient for 
the one-time batch.
   - it parses Jira list syntax correctly  (if the list is not a complex one)
   - it does not cause stack overflows and improves the throughput (30~40% 
faster than current main on my desktop)
   
   Still, this takes four or five hours for me, we could parallelize it (#36) 
so that we can improve/test it more often.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] mayya-sharipova commented on a diff in pull request #992: LUCENE-10592 Build HNSW Graph on indexing

2022-07-12 Thread GitBox


mayya-sharipova commented on code in PR #992:
URL: https://github.com/apache/lucene/pull/992#discussion_r919343914


##
lucene/core/src/java/org/apache/lucene/codecs/lucene93/Lucene93HnswVectorsWriter.java:
##
@@ -266,65 +470,128 @@ private void writeMeta(
 }
   }
 
-  private OnHeapHnswGraph writeGraph(
-  RandomAccessVectorValuesProducer vectorValues, VectorSimilarityFunction 
similarityFunction)
+  /**
+   * Writes the vector values to the output and returns a set of documents 
that contains vectors.
+   */
+  private static DocsWithFieldSet writeVectorData(IndexOutput output, 
VectorValues vectors)
   throws IOException {
+DocsWithFieldSet docsWithField = new DocsWithFieldSet();
+for (int docV = vectors.nextDoc(); docV != NO_MORE_DOCS; docV = 
vectors.nextDoc()) {
+  // write vector
+  BytesRef binaryValue = vectors.binaryValue();
+  assert binaryValue.length == vectors.dimension() * Float.BYTES;
+  output.writeBytes(binaryValue.bytes, binaryValue.offset, 
binaryValue.length);
+  docsWithField.add(docV);
+}
+return docsWithField;
+  }
 
-// build graph
-HnswGraphBuilder hnswGraphBuilder =
-new HnswGraphBuilder(
-vectorValues, similarityFunction, M, beamWidth, 
HnswGraphBuilder.randSeed);
-hnswGraphBuilder.setInfoStream(segmentWriteState.infoStream);
-OnHeapHnswGraph graph = 
hnswGraphBuilder.build(vectorValues.randomAccess());
+  @Override
+  public void close() throws IOException {
+IOUtils.close(meta, vectorData, vectorIndex);
+  }
 
-// write vectors' neighbours on each level into the vectorIndex file
-int countOnLevel0 = graph.size();
-for (int level = 0; level < graph.numLevels(); level++) {
-  int maxConnOnLevel = level == 0 ? (M * 2) : M;
-  NodesIterator nodesOnLevel = graph.getNodesOnLevel(level);
-  while (nodesOnLevel.hasNext()) {
-int node = nodesOnLevel.nextInt();
-NeighborArray neighbors = graph.getNeighbors(level, node);
-int size = neighbors.size();
-vectorIndex.writeInt(size);
-// Destructively modify; it's ok we are discarding it after this
-int[] nnodes = neighbors.node();
-Arrays.sort(nnodes, 0, size);
-for (int i = 0; i < size; i++) {
-  int nnode = nnodes[i];
-  assert nnode < countOnLevel0 : "node too large: " + nnode + ">=" + 
countOnLevel0;
-  vectorIndex.writeInt(nnode);
-}
-// if number of connections < maxConn, add bogus values up to maxConn 
to have predictable
-// offsets
-for (int i = size; i < maxConnOnLevel; i++) {
-  vectorIndex.writeInt(0);
-}
+  private static class FieldData extends KnnFieldVectorsWriter {

Review Comment:
   Nice suggestion! Addressed in b47ddc9e6834e2b0a838adf7fc1bed791b24ce2e



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] mayya-sharipova commented on a diff in pull request #992: LUCENE-10592 Build HNSW Graph on indexing

2022-07-12 Thread GitBox


mayya-sharipova commented on code in PR #992:
URL: https://github.com/apache/lucene/pull/992#discussion_r919288022


##
lucene/core/src/java/org/apache/lucene/codecs/KnnVectorsWriter.java:
##
@@ -24,28 +24,40 @@
 import org.apache.lucene.index.DocIDMerger;
 import org.apache.lucene.index.FieldInfo;
 import org.apache.lucene.index.MergeState;
+import org.apache.lucene.index.Sorter;
 import org.apache.lucene.index.VectorValues;
 import org.apache.lucene.search.TopDocs;
+import org.apache.lucene.util.Accountable;
 import org.apache.lucene.util.Bits;
 import org.apache.lucene.util.BytesRef;
 
 /** Writes vectors to an index. */
-public abstract class KnnVectorsWriter implements Closeable {
+public abstract class KnnVectorsWriter implements Accountable, Closeable {
 
   /** Sole constructor */
   protected KnnVectorsWriter() {}
 
-  /** Write all values contained in the provided reader */
-  public abstract void writeField(FieldInfo fieldInfo, KnnVectorsReader 
knnVectorsReader)
+  /** Add new field for indexing */
+  public abstract void addField(FieldInfo fieldInfo) throws IOException;
+
+  /** Add new docID with its vector value to the given field for indexing */
+  public abstract void addValue(FieldInfo fieldInfo, int docID, float[] 
vectorValue)
+  throws IOException;
+
+  /** Flush all buffered data on disk * */
+  public abstract void flush(int maxDoc, Sorter.DocMap sortMap) throws 
IOException;

Review Comment:
   @jtibshirani  Thanks for the suggestion, I thought how to organize it, and I 
could not find a good way to do it, so I left the things as they are.
   
   In `IndexingChain#flush` we could have called `KnnFieldVectorsWriter#flush`, 
but `flush` operation also requires do `writer.finish();` and close the writer, 
so it is better managed by `VectorValuesConsumer` than individual 
`KnnFieldVectorsWriter` objects.
   
   >  this would help make Lucene93HnswVectorsWriter easier to read, because we 
could separate out the complex sorting logic into a class like 
SortingFieldWriter
   
   This is also challenging to implement because whether a field writer needs 
to be `SortingFieldWriter` only becomes known during flush (`if sortMap != 
null` ), so this would require converting usual field writer object to 
`SortingFieldWriter`  on flush, which doesn't look nice. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] luyuncheng commented on a diff in pull request #987: LUCENE-10627: Using CompositeByteBuf to Reduce Memory Copy

2022-07-12 Thread GitBox


luyuncheng commented on code in PR #987:
URL: https://github.com/apache/lucene/pull/987#discussion_r918848057


##
lucene/core/src/java/org/apache/lucene/codecs/compressing/CompressionMode.java:
##
@@ -257,9 +270,13 @@ private static class DeflateCompressor extends Compressor {
 }
 
 @Override
-public void compress(byte[] bytes, int off, int len, DataOutput out) 
throws IOException {
+public void compress(ByteBuffersDataInput buffersInput, int off, int len, 
DataOutput out)

Review Comment:
   it is a nice suggestion, i try to use method `compress(CompositeByteBuf 
compositeByteBuf, DataOutput out) ` 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10471) Increase the number of dims for KNN vectors to 2048

2022-07-12 Thread Michael Sokolov (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17565468#comment-17565468
 ] 

Michael Sokolov commented on LUCENE-10471:
--

We should not be imposing an arbitrary limit that prevents people with CNNs 
(image-processing models) from using this feature. It makes sense to me to 
increase the limit to the point where we would see actual bugs/failures, or 
where the large numbers might prevent us from making some future optimization, 
rather than trying to determine where the performance stops being acceptable - 
that's a question for users to decide for themselves. Of course we don't know 
where that place is that we might want to optimize in the future (Rob and I 
discussed an idea using all-integer math that would suffer from overflow, but 
still we should not just allow MAX_INT dimensions I think? To me a limit like 
16K makes sense – well beyond any stated use case, but not effectively infinite?

> Increase the number of dims for KNN vectors to 2048
> ---
>
> Key: LUCENE-10471
> URL: https://issues.apache.org/jira/browse/LUCENE-10471
> Project: Lucene - Core
>  Issue Type: Wish
>Reporter: Mayya Sharipova
>Priority: Trivial
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> The current maximum allowed number of dimensions is equal to 1024. But we see 
> in practice a couple well-known models that produce vectors with > 1024 
> dimensions (e.g 
> [mobilenet_v2|https://tfhub.dev/google/imagenet/mobilenet_v2_035_224/feature_vector/1]
>  uses 1280d vectors, OpenAI / GPT-3 Babbage uses 2048d vectors). Increasing 
> max dims to `2048` will satisfy these use cases.
> I am wondering if anybody has strong objections against this.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] luyuncheng commented on pull request #987: LUCENE-10627: Using CompositeByteBuf to Reduce Memory Copy

2022-07-12 Thread GitBox


luyuncheng commented on PR #987:
URL: https://github.com/apache/lucene/pull/987#issuecomment-1181632413

   > Would it be possible to remove all `CompressionMode#compress` variants 
that take a `byte[]` now that you introduced a new method that takes a 
`ByteBuffersDataInput`?
   > 
   > Also maybe we should keep old codecs unmodified and only make this change 
to `Lucene90Codec` where it makes most sense?
   
   Hi @jpountz Thanks for reviewing this code. 
   
   I prefer keeping old codecs unmodified, because `CompressionMode#compress` 
is a public abstract method, if we change it with variants 
`ByteBuffersDataInput` we need to backport in many codecs, like 
[commits](https://github.com/apache/lucene/blob/382962f22df3ee3af3fb538b877c98d61a622ddb/lucene/backward-codecs/src/java/org/apache/lucene/backward_codecs/lucene50/Lucene50StoredFieldsFormat.java).
   
   And if we only using compress method with variants ByteBuffersDataInput in 
LUCENE90, we can not using abstract method `Compressor.compress`, when we want 
to use other compression mode.
   
   Would it be possible to add a new method in Compressor, like following? it 
can keep the old codecs unmodified, and method with variants 
ByteBuffersDataInput only can be called in Lucene90Codec.
   
   ```
   public abstract void compress(byte[] bytes, int off, int len, DataOutput 
out) throws IOException;
   
   public void compress(CompositeByteBuf compositeByteBuf, DataOutput out) 
throws IOException;
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] mayya-sharipova commented on a diff in pull request #992: LUCENE-10592 Build HNSW Graph on indexing

2022-07-12 Thread GitBox


mayya-sharipova commented on code in PR #992:
URL: https://github.com/apache/lucene/pull/992#discussion_r919332844


##
lucene/core/src/java/org/apache/lucene/codecs/perfield/PerFieldKnnVectorsFormat.java:
##
@@ -102,9 +104,22 @@ private class FieldsWriter extends KnnVectorsWriter {
 }
 
 @Override
-public void writeField(FieldInfo fieldInfo, KnnVectorsReader 
knnVectorsReader)
+public KnnFieldVectorsWriter addField(FieldInfo fieldInfo) throws 
IOException {
+  KnnVectorsWriter writer = getInstance(fieldInfo);
+  return writer.addField(fieldInfo);
+}
+
+@Override
+public void flush(int maxDoc, Sorter.DocMap sortMap) throws IOException {
+  for (WriterAndSuffix was : formats.values()) {
+was.writer.flush(maxDoc, sortMap);
+  }
+}
+
+@Override
+public void mergeOneField(FieldInfo fieldInfo, KnnVectorsReader 
knnVectorsReader)
 throws IOException {
-  getInstance(fieldInfo).writeField(fieldInfo, knnVectorsReader);
+  getInstance(fieldInfo).mergeOneField(fieldInfo, knnVectorsReader);

Review Comment:
   @jtibshirani  Thanks for the suggestion, but we can't at the same time make 
`KnnVectorsWriter#merge` final and also un-support  this `mergeOneField`?
   
   Should we keep `mergeOneField` and make `KnnVectorsWriter#merge` final as 
your other suggestion?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10649) Failure in TestDemoParallelLeafReader.testRandomMultipleSchemaGensSameField

2022-07-12 Thread Adrien Grand (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10649?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17565885#comment-17565885
 ] 

Adrien Grand commented on LUCENE-10649:
---

Good catch [~vigyas], it looks related indeed. The bug seems to be that 
{{ReindexingMergePolicy}} doesn't override {{findFullFlushMerges}} to wrap 
input readers, so the merged segment doesn't get fields from the parallel 
reader. Would you like to open a PR?

> Failure in TestDemoParallelLeafReader.testRandomMultipleSchemaGensSameField
> ---
>
> Key: LUCENE-10649
> URL: https://issues.apache.org/jira/browse/LUCENE-10649
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Vigya Sharma
>Priority: Major
>
> Failing Build Link: 
> [https://jenkins.thetaphi.de/job/Lucene-main-Linux/35617/testReport/junit/org.apache.lucene.index/TestDemoParallelLeafReader/testRandomMultipleSchemaGensSameField/]
> Repro:
> {code:java}
> gradlew test --tests 
> TestDemoParallelLeafReader.testRandomMultipleSchemaGensSameField 
> -Dtests.seed=A7496D7D3957981A -Dtests.multiplier=3 -Dtests.locale=sr-Latn-BA 
> -Dtests.timezone=Etc/GMT-7 -Dtests.asserts=true -Dtests.file.encoding=UTF-8 
> {code}
> Error:
> {code:java}
> java.lang.AssertionError: expected:<103> but was:<2147483647>
>     at 
> __randomizedtesting.SeedInfo.seed([A7496D7D3957981A:F71866BCCEA1C903]:0)
>     at org.junit.Assert.fail(Assert.java:89)
>     at org.junit.Assert.failNotEquals(Assert.java:835)
>     at org.junit.Assert.assertEquals(Assert.java:647)
>     at org.junit.Assert.assertEquals(Assert.java:633)
>     at 
> org.apache.lucene.index.TestDemoParallelLeafReader.testRandomMultipleSchemaGensSameField(TestDemoParallelLeafReader.java:1347)
>     at 
> java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:104)
>  {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-jira-archive] mocobeta opened a new pull request, #39: Fix stack overflow when parsing lists

2022-07-12 Thread GitBox


mocobeta opened a new pull request, #39:
URL: https://github.com/apache/lucene-jira-archive/pull/39

   Close #38 
   
   This ad-hoc patch fixes `'maximum recursion depth exceeded'` error, and also 
makes the script a bit faster. (8h -> 5h)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10628) Enable MatchingFacetSetCounts to use space partitioning data structures

2022-07-12 Thread Ignacio Vera (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10628?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17565882#comment-17565882
 ] 

Ignacio Vera commented on LUCENE-10628:
---

I have mainly worked with two type of trees in Lucene. 

* 
[KD-tree|https://github.com/apache/lucene/blob/35ca2d79f73c6dfaf5e648fe241f7e0b37084a90/lucene/core/src/java/org/apache/lucene/util/bkd/BKDWriter.java#L79]:
  It is complex to build so probably not suitable for building them on the fly 
but best structure for an index.

* [Interval 
tree|https://github.com/apache/lucene/blob/2d6ad2fee6dfd96388594f4de9b37c037efe8017/lucene/core/src/java/org/apache/lucene/geo/ComponentTree.java#L28]
 (I think originally introduced by Robert Muir):  Not as efficient as a kd-tree 
but much cheaper to build and suitable for small data.

>From a quick look I think you would be looking for an interval tree but mind 
>you that I have only worked with that tree for very low dimension. I guess 
>this kind of tree will quickly degenerate due to the [curse of 
>dimensionality|https://en.wikipedia.org/wiki/Curse_of_dimensionality]. How may 
>dimensions are you expecting to support?



> Enable MatchingFacetSetCounts to use space partitioning data structures
> ---
>
> Key: LUCENE-10628
> URL: https://issues.apache.org/jira/browse/LUCENE-10628
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Marc D'Mello
>Priority: Minor
>
> Currently, {{MatchingFacetSetCounts}} iterates over {{FacetSetMatcher}} 
> instances passed into it linearly. While this is fine in some cases, if we 
> have a large amount of {{FacetSetMatcher}}'s, this can be inefficient. We 
> should provide the option to users to enable the use of space partitioning 
> data structures (namely R trees and KD trees) so we can potentially scan over 
> these {{FacetSetMatcher}}'s in sub-linear time.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-jira-archive] mocobeta commented on pull request #33: Polish wording of Legacy Jira details header, and each comment footer

2022-07-12 Thread GitBox


mocobeta commented on PR #33:
URL: 
https://github.com/apache/lucene-jira-archive/pull/33#issuecomment-1181624324

   > Thanks -- I was beginning to wonder if it was normal how long it was 
taking ;)
   
   Of course it's not normal; I remember it took two or three hours to convert 
the whole Jira snapshot when I did the last test migration. Not so fast, but 
was acceptable speed (taking into account this is entirely written in python).
   I've added some custom syntax parser components (e.g. 
https://github.com/apache/lucene-jira-archive/pull/19) to fix conversion errors 
since then - some of them should cause a slowdown in parsing.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] mayya-sharipova commented on a diff in pull request #992: LUCENE-10592 Build HNSW Graph on indexing

2022-07-12 Thread GitBox


mayya-sharipova commented on code in PR #992:
URL: https://github.com/apache/lucene/pull/992#discussion_r919288022


##
lucene/core/src/java/org/apache/lucene/codecs/KnnVectorsWriter.java:
##
@@ -24,28 +24,40 @@
 import org.apache.lucene.index.DocIDMerger;
 import org.apache.lucene.index.FieldInfo;
 import org.apache.lucene.index.MergeState;
+import org.apache.lucene.index.Sorter;
 import org.apache.lucene.index.VectorValues;
 import org.apache.lucene.search.TopDocs;
+import org.apache.lucene.util.Accountable;
 import org.apache.lucene.util.Bits;
 import org.apache.lucene.util.BytesRef;
 
 /** Writes vectors to an index. */
-public abstract class KnnVectorsWriter implements Closeable {
+public abstract class KnnVectorsWriter implements Accountable, Closeable {
 
   /** Sole constructor */
   protected KnnVectorsWriter() {}
 
-  /** Write all values contained in the provided reader */
-  public abstract void writeField(FieldInfo fieldInfo, KnnVectorsReader 
knnVectorsReader)
+  /** Add new field for indexing */
+  public abstract void addField(FieldInfo fieldInfo) throws IOException;
+
+  /** Add new docID with its vector value to the given field for indexing */
+  public abstract void addValue(FieldInfo fieldInfo, int docID, float[] 
vectorValue)
+  throws IOException;
+
+  /** Flush all buffered data on disk * */
+  public abstract void flush(int maxDoc, Sorter.DocMap sortMap) throws 
IOException;

Review Comment:
   @jtibshirani  Thanks for the suggestion, I thought how to organize it, and I 
could not find a good way to do it, so I left the things as they are.
   
   In `IndexingChain#flush` we could have called `KnnFieldVectorsWriter#flush`, 
but `flush` operation also requires do `writer.finish();` and close the writer, 
so it is better managed by `VectorValuesConsumer` than individual 
`KnnFieldVectorsWriter` objects.
   
   >  this would help make Lucene93HnswVectorsWriter easier to read, because we 
could separate out the complex sorting logic into a class like 
SortingFieldWriter
   
   This is also challenging to implement because whether a field writer is 
`SortingFieldWriter` only becomes known during flush, so this would require 
converting usual field writer object to `SortingFieldWriter`  on flush, which 
doesn't look nice. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10603) Improve iteration of ords for SortedSetDocValues

2022-07-12 Thread Adrien Grand (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10603?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17565402#comment-17565402
 ] 

Adrien Grand commented on LUCENE-10603:
---

+1

> Improve iteration of ords for SortedSetDocValues
> 
>
> Key: LUCENE-10603
> URL: https://issues.apache.org/jira/browse/LUCENE-10603
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Lu Xugang
>Assignee: Lu Xugang
>Priority: Trivial
>  Time Spent: 5h 20m
>  Remaining Estimate: 0h
>
> After SortedSetDocValues#docValueCount added since Lucene 9.2, should we 
> refactor the implementation of ords iterations using docValueCount instead of 
> NO_MORE_ORDS?
> Similar how SortedNumericDocValues did
> From 
> {code:java}
> for (long ord = values.nextOrd();ord != SortedSetDocValues.NO_MORE_ORDS; ord 
> = values.nextOrd()) {
> }{code}
> to
> {code:java}
> for (int i = 0; i < values.docValueCount(); i++) {
>   long ord = values.nextOrd();
> }{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] mayya-sharipova commented on a diff in pull request #992: LUCENE-10592 Build HNSW Graph on indexing

2022-07-12 Thread GitBox


mayya-sharipova commented on code in PR #992:
URL: https://github.com/apache/lucene/pull/992#discussion_r919288022


##
lucene/core/src/java/org/apache/lucene/codecs/KnnVectorsWriter.java:
##
@@ -24,28 +24,40 @@
 import org.apache.lucene.index.DocIDMerger;
 import org.apache.lucene.index.FieldInfo;
 import org.apache.lucene.index.MergeState;
+import org.apache.lucene.index.Sorter;
 import org.apache.lucene.index.VectorValues;
 import org.apache.lucene.search.TopDocs;
+import org.apache.lucene.util.Accountable;
 import org.apache.lucene.util.Bits;
 import org.apache.lucene.util.BytesRef;
 
 /** Writes vectors to an index. */
-public abstract class KnnVectorsWriter implements Closeable {
+public abstract class KnnVectorsWriter implements Accountable, Closeable {
 
   /** Sole constructor */
   protected KnnVectorsWriter() {}
 
-  /** Write all values contained in the provided reader */
-  public abstract void writeField(FieldInfo fieldInfo, KnnVectorsReader 
knnVectorsReader)
+  /** Add new field for indexing */
+  public abstract void addField(FieldInfo fieldInfo) throws IOException;
+
+  /** Add new docID with its vector value to the given field for indexing */
+  public abstract void addValue(FieldInfo fieldInfo, int docID, float[] 
vectorValue)
+  throws IOException;
+
+  /** Flush all buffered data on disk * */
+  public abstract void flush(int maxDoc, Sorter.DocMap sortMap) throws 
IOException;

Review Comment:
   @jtibshirani  Thanks for the suggestion, I thought how to organize it, and I 
could not find a good way to do it, so I left the things as they are.
   
   In `IndexingChain#flush` we could have called `KnnFieldVectorsWriter#flush`, 
but `flush` operation also requires do `writer.finish();` and close the writer, 
so it is better managed by `VectorValuesConsumer` than individual 
``KnnFieldVectorsWriter`
   
   >  this would help make Lucene93HnswVectorsWriter easier to read, because we 
could separate out the complex sorting logic into a class like 
SortingFieldWriter
   
   This is also challenging to implement because whether a field writer is 
`SortingFieldWriter` only becomes known during flush, so this would require 
converting usual field writer object to `SortingFieldWriter`  on flush, which 
doesn't look nice. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-jira-archive] mikemccand closed pull request #33: Polish wording of Legacy Jira details header, and each comment footer

2022-07-12 Thread GitBox


mikemccand closed pull request #33: Polish wording of Legacy Jira details 
header, and each comment footer
URL: https://github.com/apache/lucene-jira-archive/pull/33


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-jira-archive] mikemccand opened a new pull request, #40: #27: polish the legacy Jira text added to the issue a bit

2022-07-12 Thread GitBox


mikemccand opened a new pull request, #40:
URL: https://github.com/apache/lucene-jira-archive/pull/40

   I "rebooted" my PR by downloading the diff off the messed up #33 PR, futzing 
it locally, applying, resolving conflicts.  Messy messy.  I'll try to more 
carefully manage the git merging steps next time ...
   
   I re-tested that this version is able to export tricky issues LUCENE-550 and 
LUCENE-4341, still showing the stack overflow error until we push @mocobeta's 
nice fix in #39.
   
   Closes #33.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-jira-archive] mikemccand commented on pull request #33: Polish wording of Legacy Jira details header, and each comment footer

2022-07-12 Thread GitBox


mikemccand commented on PR #33:
URL: 
https://github.com/apache/lucene-jira-archive/pull/33#issuecomment-1181821682

   I'm closing this messed up PR -- I rebooted it into #40.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jpountz commented on a diff in pull request #1003: LUCENE-10616: optimizing decompress when only retrieving some fields

2022-07-12 Thread GitBox


jpountz commented on code in PR #1003:
URL: https://github.com/apache/lucene/pull/1003#discussion_r918758391


##
lucene/core/src/java/org/apache/lucene/codecs/compressing/Decompressor.java:
##
@@ -42,6 +44,13 @@ protected Decompressor() {}
   public abstract void decompress(
   DataInput in, int originalLength, int offset, int length, BytesRef 
bytes) throws IOException;
 
+  public InputStream decompress(DataInput in, int originalLength, int offset, 
int length)

Review Comment:
   Maybe it would help to make it a `DataInput` and use `ByteArrayDataInput`.



##
lucene/core/src/java/org/apache/lucene/document/DocumentStoredFieldVisitor.java:
##
@@ -98,6 +100,16 @@ public void doubleField(FieldInfo fieldInfo, double value) {
 
   @Override
   public Status needsField(FieldInfo fieldInfo) throws IOException {
+// return stop after collected all needed fields
+if (fieldsToAdd != null
+&& !fieldsToAdd.contains(fieldInfo.name)
+&& fieldsToAdd.size()
+== doc.getFields().stream()
+.map(IndexableField::name)
+.collect(Collectors.toSet())
+.size()) {
+  return Status.STOP;

Review Comment:
   This isn't correct, some fields could have multiple values?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jpountz commented on a diff in pull request #966: LUCENE-10619: Optimize the writeBytes in TermsHashPerField

2022-07-12 Thread GitBox


jpountz commented on code in PR #966:
URL: https://github.com/apache/lucene/pull/966#discussion_r918804129


##
lucene/core/src/java/org/apache/lucene/index/TermsHashPerField.java:
##
@@ -230,9 +230,29 @@ final void writeByte(int stream, byte b) {
   }
 
   final void writeBytes(int stream, byte[] b, int offset, int len) {
-// TODO: optimize
 final int end = offset + len;
-for (int i = offset; i < end; i++) writeByte(stream, b[i]);
+int streamAddress = streamAddressOffset + stream;
+int upto = termStreamAddressBuffer[streamAddress];
+byte[] slice = bytePool.buffers[upto >> ByteBlockPool.BYTE_BLOCK_SHIFT];
+assert slice != null;
+int sliceOffset = upto & ByteBlockPool.BYTE_BLOCK_MASK;
+
+while (slice[sliceOffset] == 0 && offset < end) {
+  slice[sliceOffset++] = b[offset++];
+  (termStreamAddressBuffer[streamAddress])++;
+}

Review Comment:
   Maybe in the future we could optimize this case a bit too by using 
`Arrays#mismatch` with an array that is full of zeroes.



##
lucene/core/src/test/org/apache/lucene/index/TestTermsHashPerField.java:
##
@@ -298,4 +299,23 @@ class Posting {
   assertTrue("the last posting must be EOF on the reader", eof);
 }
   }
+
+  public void testWriteBytes() throws IOException {
+for (int i = 0; i < 100; i++) {
+  AtomicInteger newCalled = new AtomicInteger(0);
+  AtomicInteger addCalled = new AtomicInteger(0);
+  TermsHashPerField hash = createNewHash(newCalled, addCalled);
+  hash.start(null, true);
+  hash.add(newBytesRef("start"), 0); // tid = 0;
+  int size = TestUtil.nextInt(random(), 5, 10);
+  byte[] randomData = new byte[size];
+  random().nextBytes(randomData);
+  hash.writeBytes(0, randomData, 0, randomData.length);

Review Comment:
   Maybe change this to write small chunks at once to better exercise the case 
when we're starting a write in the middle or at the end of a slice?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-jira-archive] mocobeta commented on issue #38: StackOverflowException on certain issue descriptions and comment text

2022-07-12 Thread GitBox


mocobeta commented on issue #38:
URL: 
https://github.com/apache/lucene-jira-archive/issues/38#issuecomment-1181605666

   Thank you for opening this.
   
   While the stack overflow is rare, this recursion in parsing also causes a 
significant slowdown in conversion.
   I'm sure the root cause of the slow down and stack overflow is this line (a 
customized version of Jira list syntax parser):
   
https://github.com/apache/lucene-jira-archive/blob/b4f125913eb77ed807d4f1a5836ac4d330f2352a/migration/src/markup/lists.py#L69
   
   I'm trying to find other ways that do not cause infinite recursion to parse 
lists correctly.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (LUCENE-10600) SortedSetDocValues#docValueCount should be an int, not long

2022-07-12 Thread Adrien Grand (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10600?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrien Grand updated LUCENE-10600:
--
Fix Version/s: 9.3

> SortedSetDocValues#docValueCount should be an int, not long
> ---
>
> Key: LUCENE-10600
> URL: https://issues.apache.org/jira/browse/LUCENE-10600
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Adrien Grand
>Assignee: Lu Xugang
>Priority: Minor
> Fix For: 9.3
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10480) Specialize 2-clauses disjunctions

2022-07-12 Thread Adrien Grand (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17565375#comment-17565375
 ] 

Adrien Grand commented on LUCENE-10480:
---

+1 to explore this in a separate issue.

bq. Do you think this slowdown to AndHighOrMedMed may be considered as blocker 
to 9.3 release? 

I wouldn't say blocker, but maybe we could give us time indeed by only using 
this new scorer on top-level disjunctions for now so that we have more time to 
figure out whether we should stick to BMW or switch to BMM for inner 
disjunctions.

> Specialize 2-clauses disjunctions
> -
>
> Key: LUCENE-10480
> URL: https://issues.apache.org/jira/browse/LUCENE-10480
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Adrien Grand
>Priority: Minor
>  Time Spent: 7h 20m
>  Remaining Estimate: 0h
>
> WANDScorer is nice, but it also has lots of overhead to maintain its 
> invariants: one linked list for the current candidates, one priority queue of 
> scorers that are behind, another one for scorers that are ahead. All this 
> could be simplified in the 2-clauses case, which feels worth specializing for 
> as it's very common that end users enter queries that only have two terms?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-jira-archive] mocobeta commented on issue #36: Can we parallelize the converter script?

2022-07-12 Thread GitBox


mocobeta commented on issue #36:
URL: 
https://github.com/apache/lucene-jira-archive/issues/36#issuecomment-1181497062

   I found https://pypi.org/project/multiprocessing-logging/, but this works 
only on Linux.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-jira-archive] mocobeta commented on pull request #33: Polish wording of Legacy Jira details header, and each comment footer

2022-07-12 Thread GitBox


mocobeta commented on PR #33:
URL: 
https://github.com/apache/lucene-jira-archive/pull/33#issuecomment-1181456586

   I'm also converting the whole Jira issue myself; it looks like it takes 
several hours... (recent changes to fix conversion errors could affect the 
conversion speed I think). This shouldn't be so slow, raised #36. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-jira-archive] mocobeta opened a new issue, #36: Can we parallelize the converter script?

2022-07-12 Thread GitBox


mocobeta opened a new issue, #36:
URL: https://github.com/apache/lucene-jira-archive/issues/36

   `jira2markdown_imprt.py` is single-threaded and it takes several hours to 
convert all Jira issues.
   I think it'd be easy to parallelize this with 
[multiprocessing](https://docs.python.org/3/library/multiprocessing.html) 
module (it does not call any HTTP APIs), but I remember a few years ago, 
`logging` was not thread-safe.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] stefanvodita commented on a diff in pull request #1015: [LUCENE-10629]: Add fast match query support to FacetSets

2022-07-12 Thread GitBox


stefanvodita commented on code in PR #1015:
URL: https://github.com/apache/lucene/pull/1015#discussion_r918597529


##
lucene/facet/src/java/org/apache/lucene/facet/facetset/MatchingFacetSetsCounts.java:
##
@@ -52,8 +52,10 @@ public MatchingFacetSetsCounts(
   String field,
   FacetsCollector hits,
   FacetSetDecoder facetSetDecoder,
+  Query fastMatchQuery,

Review Comment:
   Thanks! I'm happy with the PR as it is now.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org