[GitHub] [lucene] jpountz commented on a diff in pull request #967: LUCENE-10623: Error implementation of docValueCount for SortingSortedSetDocValues

2022-06-28 Thread GitBox


jpountz commented on code in PR #967:
URL: https://github.com/apache/lucene/pull/967#discussion_r908179203


##
lucene/core/src/java/org/apache/lucene/index/SortedSetDocValuesWriter.java:
##
@@ -415,34 +419,45 @@ public BytesRef lookupOrd(long ord) throws IOException {
 public long getValueCount() {
   return in.getValueCount();
 }
+
+private void initCount() {
+  assert docID >= 0;
+  ordUpto = ords.offsets[docID] - 1;
+  count = (int) ords.docValueCounts.get(docID);
+  limit = ordUpto + count;
+}
   }
 
   static final class DocOrds {
 final long[] offsets;
 final PackedLongValues ords;
+final GrowableWriter docValueCounts;
+
+public static final int START_BITS_PER_VALUE = 2;
 
 DocOrds(
 int maxDoc,
 Sorter.DocMap sortMap,
 SortedSetDocValues oldValues,
-float acceptableOverheadRatio)
+float acceptableOverheadRatio,
+int bitsPerValue)
 throws IOException {
   offsets = new long[maxDoc];
   PackedLongValues.Builder builder = 
PackedLongValues.packedBuilder(acceptableOverheadRatio);
-  long ordOffset = 1; // 0 marks docs with no values
+  docValueCounts = new GrowableWriter(bitsPerValue, maxDoc, 
acceptableOverheadRatio);
+  long ordOffset = 1;

Review Comment:
   +1



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jpountz commented on a diff in pull request #967: LUCENE-10623: Error implementation of docValueCount for SortingSortedSetDocValues

2022-06-27 Thread GitBox


jpountz commented on code in PR #967:
URL: https://github.com/apache/lucene/pull/967#discussion_r907352353


##
lucene/core/src/java/org/apache/lucene/index/SortedSetDocValuesWriter.java:
##
@@ -415,34 +419,45 @@ public BytesRef lookupOrd(long ord) throws IOException {
 public long getValueCount() {
   return in.getValueCount();
 }
+
+private void initCount() {
+  assert docID >= 0;
+  ordUpto = ords.offsets[docID] - 1;
+  count = (int) ords.docValueCounts.get(docID);
+  limit = ordUpto + count;
+}
   }
 
   static final class DocOrds {
 final long[] offsets;
 final PackedLongValues ords;
+final GrowableWriter docValueCounts;
+
+public static final int START_BITS_PER_VALUE = 2;
 
 DocOrds(
 int maxDoc,
 Sorter.DocMap sortMap,
 SortedSetDocValues oldValues,
-float acceptableOverheadRatio)
+float acceptableOverheadRatio,
+int bitsPerValue)
 throws IOException {
   offsets = new long[maxDoc];
   PackedLongValues.Builder builder = 
PackedLongValues.packedBuilder(acceptableOverheadRatio);
-  long ordOffset = 1; // 0 marks docs with no values
+  docValueCounts = new GrowableWriter(bitsPerValue, maxDoc, 
acceptableOverheadRatio);
+  long ordOffset = 1;

Review Comment:
   Let's start at 0 to not have to subtract 0 all the time in `initCount()`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jpountz commented on a diff in pull request #967: LUCENE-10623: Error implementation of docValueCount for SortingSortedSetDocValues

2022-06-27 Thread GitBox


jpountz commented on code in PR #967:
URL: https://github.com/apache/lucene/pull/967#discussion_r907164385


##
lucene/core/src/java/org/apache/lucene/index/SortedSetDocValuesWriter.java:
##
@@ -382,23 +386,20 @@ public int advance(int target) {
 public boolean advanceExact(int target) throws IOException {
   // needed in IndexSorter#StringSorter
   docID = target;
+  initCount();
   ordUpto = ords.offsets[docID] - 1;
   return ords.offsets[docID] > 0;
 }
 
 @Override
 public long nextOrd() {
-  long ord = ords.ords.get(ordUpto++);
-  if (ord == 0) {
-return NO_MORE_ORDS;
-  } else {
-return ord - 1;
-  }
+  return ords.ords.get(ordUpto++);

Review Comment:
   The problem is that custom DocValuesFormats would no longer work if they 
iterate over values using NO_MORE_ORDS. Maybe it's ok because users who write 
custom codecs are expert users, but I'd rather discuss it on a separate issue 
that do it silently as part of this bug fix?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jpountz commented on a diff in pull request #967: LUCENE-10623: Error implementation of docValueCount for SortingSortedSetDocValues

2022-06-27 Thread GitBox


jpountz commented on code in PR #967:
URL: https://github.com/apache/lucene/pull/967#discussion_r907109652


##
lucene/core/src/java/org/apache/lucene/index/SortedSetDocValuesWriter.java:
##
@@ -382,23 +386,20 @@ public int advance(int target) {
 public boolean advanceExact(int target) throws IOException {
   // needed in IndexSorter#StringSorter
   docID = target;
+  initCount();
   ordUpto = ords.offsets[docID] - 1;
   return ords.offsets[docID] > 0;
 }
 
 @Override
 public long nextOrd() {
-  long ord = ords.ords.get(ordUpto++);
-  if (ord == 0) {
-return NO_MORE_ORDS;
-  } else {
-return ord - 1;
-  }
+  return ords.ords.get(ordUpto++);

Review Comment:
   We should keep returning NO_MORE_ORDS when ords are exhausted.



##
lucene/core/src/java/org/apache/lucene/index/SortedSetDocValuesWriter.java:
##
@@ -415,34 +416,43 @@ public BytesRef lookupOrd(long ord) throws IOException {
 public long getValueCount() {
   return in.getValueCount();
 }
+
+private void initCount() {
+  assert docID >= 0;
+  count = (int) ords.growableWriter.get(docID);
+}
   }
 
   static final class DocOrds {
 final long[] offsets;
 final PackedLongValues ords;
+final GrowableWriter growableWriter;

Review Comment:
   Let's call it `docValueCounts` or something like that that better reflects 
what it stores?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jpountz commented on a diff in pull request #967: LUCENE-10623: Error implementation of docValueCount for SortingSortedSetDocValues

2022-06-24 Thread GitBox


jpountz commented on code in PR #967:
URL: https://github.com/apache/lucene/pull/967#discussion_r905889905


##
lucene/core/src/java/org/apache/lucene/index/SortedSetDocValuesWriter.java:
##
@@ -114,6 +116,7 @@ private void finishCurrentDoc() {
   }
   lastValue = termID;
 }
+maxBitsRequired |= count;

Review Comment:
   If you are or-ing values, then it's not a number of bits per value. Maybe 
compute the max count instead, and later compute the number of bits required 
using `PackedInts#bitsRequired`?



##
lucene/core/src/java/org/apache/lucene/codecs/lucene90/Lucene90DocValuesConsumer.java:
##
@@ -805,11 +805,9 @@ public int nextDoc() throws IOException {
 int doc = values.nextDoc();
 if (doc != NO_MORE_DOCS) {
   docValueCount = 0;
-  for (long ord = values.nextOrd();
-  ord != SortedSetDocValues.NO_MORE_ORDS;
-  ord = values.nextOrd()) {
+  for (int j = 0; j < values.docValueCount(); j++) {
 ords = ArrayUtil.grow(ords, docValueCount + 1);

Review Comment:
   Let's move this `grow` call outside of the loop since the number of values 
in known up-front?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jpountz commented on a diff in pull request #967: LUCENE-10623: Error implementation of docValueCount for SortingSortedSetDocValues

2022-06-24 Thread GitBox


jpountz commented on code in PR #967:
URL: https://github.com/apache/lucene/pull/967#discussion_r905820617


##
lucene/core/src/java/org/apache/lucene/index/SortedSetDocValuesWriter.java:
##
@@ -439,29 +433,42 @@ private void set() {
   static final class DocOrds {
 final long[] offsets;
 final PackedLongValues ords;
+final GrowableWriter growableWriter;
+
+public static final int START_BITS_PER_VALUE = 2;
 

Review Comment:
   I suspect that most multi-valued fields still have a small number of bits 
per value e.g. 2 or 3, so START_BITS_PER_VALUE=2 makes sense to me.



##
lucene/core/src/java/org/apache/lucene/index/SortedSetDocValuesWriter.java:
##
@@ -350,6 +354,8 @@ static class SortingSortedSetDocValues extends 
SortedSetDocValues {
 private final DocOrds ords;
 private int docID = -1;
 private long ordUpto;
+private boolean set;

Review Comment:
   maybe set the `count` on every call to `nextDoc` so that you don't have to 
maintain this `set` flag to know whether the count is set or not?



##
lucene/core/src/java/org/apache/lucene/index/SortedSetDocValuesWriter.java:
##
@@ -415,34 +420,55 @@ public BytesRef lookupOrd(long ord) throws IOException {
 public long getValueCount() {
   return in.getValueCount();
 }
+
+private void set() {
+  if (set == false) {
+assert docID >= 0;
+count = (int) ords.growableWriter.get(docID);
+set = true;
+  }
+}
   }
 
   static final class DocOrds {
 final long[] offsets;
 final PackedLongValues ords;
+final GrowableWriter growableWriter;
+
+public static final int START_BITS_PER_VALUE = 2;
 
 DocOrds(
 int maxDoc,
 Sorter.DocMap sortMap,
 SortedSetDocValues oldValues,
 float acceptableOverheadRatio)
 throws IOException {
+  this(maxDoc, sortMap, oldValues, acceptableOverheadRatio, 
START_BITS_PER_VALUE);
+}
+
+DocOrds(
+int maxDoc,
+Sorter.DocMap sortMap,
+SortedSetDocValues oldValues,
+float acceptableOverheadRatio,
+int bitsPerValue)
+throws IOException {

Review Comment:
   Could we keep a single constructor?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org