Re: [PR] [Fix] Binary search the entries when all suffixes have the same length in a leaf block. [lucene]

2024-04-22 Thread via GitHub


vsop-479 commented on code in PR #11888:
URL: https://github.com/apache/lucene/pull/11888#discussion_r1574272996


##
lucene/core/src/java/org/apache/lucene/codecs/lucene90/blocktree/SegmentTermsEnumFrame.java:
##
@@ -642,6 +651,97 @@ public SeekStatus scanToTermLeaf(BytesRef target, boolean 
exactOnly) throws IOEx
 return SeekStatus.END;
   }
 
+  // Target's prefix matches this block's prefix;
+  // And all suffixes have the same length in this block,
+  // we binary search the entries check if the suffix matches.

Review Comment:
   Done in https://github.com/apache/lucene/pull/13279.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] [Fix] Binary search the entries when all suffixes have the same length in a leaf block. [lucene]

2024-04-02 Thread via GitHub


vsop-479 commented on PR #11888:
URL: https://github.com/apache/lucene/pull/11888#issuecomment-2033469573

   Glad to know that. Thanks @mikemccand .


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] [Fix] Binary search the entries when all suffixes have the same length in a leaf block. [lucene]

2024-04-02 Thread via GitHub


mikemccand commented on PR #11888:
URL: https://github.com/apache/lucene/pull/11888#issuecomment-2032280982

   Oooh this change gave a nice pop (~5.4%, ~915 -> 964 K lookups/sec) to the 
primary key lookup nightly benchy: 
https://home.apache.org/~mikemccand/lucenebench/PKLookup.html
   
   I'll add an annotation, exciting!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] [Fix] Binary search the entries when all suffixes have the same length in a leaf block. [lucene]

2024-04-01 Thread via GitHub


vsop-479 commented on code in PR #11888:
URL: https://github.com/apache/lucene/pull/11888#discussion_r1546339137


##
lucene/core/src/java/org/apache/lucene/codecs/lucene90/blocktree/SegmentTermsEnumFrame.java:
##
@@ -642,6 +651,97 @@ public SeekStatus scanToTermLeaf(BytesRef target, boolean 
exactOnly) throws IOEx
 return SeekStatus.END;
   }
 
+  // Target's prefix matches this block's prefix;
+  // And all suffixes have the same length in this block,
+  // we binary search the entries check if the suffix matches.

Review Comment:
   Yes, I will do it.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] [Fix] Binary search the entries when all suffixes have the same length in a leaf block. [lucene]

2024-04-01 Thread via GitHub


mikemccand merged PR #11888:
URL: https://github.com/apache/lucene/pull/11888


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] [Fix] Binary search the entries when all suffixes have the same length in a leaf block. [lucene]

2024-04-01 Thread via GitHub


mikemccand commented on PR #11888:
URL: https://github.com/apache/lucene/pull/11888#issuecomment-2029720633

   Actually I can just re-merge your prior `CHANGES.txt` entry from 
[here](https://github.com/apache/lucene/pull/11888/commits/a695c07da8ccdb348c87f98e6b4be6d778d919c3),
 so no need to push another rev here.  Thanks @vsop-479 !


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] [Fix] Binary search the entries when all suffixes have the same length in a leaf block. [lucene]

2024-04-01 Thread via GitHub


mikemccand commented on code in PR #11888:
URL: https://github.com/apache/lucene/pull/11888#discussion_r1546290149


##
lucene/core/src/java/org/apache/lucene/codecs/lucene90/blocktree/SegmentTermsEnumFrame.java:
##
@@ -642,6 +651,97 @@ public SeekStatus scanToTermLeaf(BytesRef target, boolean 
exactOnly) throws IOEx
 return SeekStatus.END;
   }
 
+  // Target's prefix matches this block's prefix;
+  // And all suffixes have the same length in this block,
+  // we binary search the entries check if the suffix matches.

Review Comment:
   > Should we do these same changes to `scanToTermLeaf` ( maybe in a new PR)?
   
   +1, separate PR?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] [Fix] Binary search the entries when all suffixes have the same length in a leaf block. [lucene]

2024-04-01 Thread via GitHub


mikemccand commented on code in PR #11888:
URL: https://github.com/apache/lucene/pull/11888#discussion_r1546289740


##
lucene/core/src/java/org/apache/lucene/codecs/lucene90/blocktree/SegmentTermsEnumFrame.java:
##
@@ -642,6 +651,97 @@ public SeekStatus scanToTermLeaf(BytesRef target, boolean 
exactOnly) throws IOEx
 return SeekStatus.END;
   }
 
+  // Target's prefix matches this block's prefix;
+  // And all suffixes have the same length in this block,
+  // we binary search the entries check if the suffix matches.

Review Comment:
   > By the way, should i add a CHANGES entry for this change?
   
   Oh yes please!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] [Fix] Binary search the entries when all suffixes have the same length in a leaf block. [lucene]

2024-03-29 Thread via GitHub


vsop-479 commented on code in PR #11888:
URL: https://github.com/apache/lucene/pull/11888#discussion_r1543957681


##
lucene/core/src/java/org/apache/lucene/codecs/lucene90/blocktree/SegmentTermsEnumFrame.java:
##
@@ -642,6 +651,97 @@ public SeekStatus scanToTermLeaf(BytesRef target, boolean 
exactOnly) throws IOEx
 return SeekStatus.END;
   }
 
+  // Target's prefix matches this block's prefix;
+  // And all suffixes have the same length in this block,
+  // we binary search the entries check if the suffix matches.

Review Comment:
   Done.
   
   > we set ste.termExists above so we could just remove this comment and the 
assert instead?
   
   > entries check -> entries to check?
   
   Should we do these same changes to `scanToTermLeaf` ( maybe in a new PR)?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] [Fix] Binary search the entries when all suffixes have the same length in a leaf block. [lucene]

2024-03-28 Thread via GitHub


vsop-479 commented on code in PR #11888:
URL: https://github.com/apache/lucene/pull/11888#discussion_r1543957681


##
lucene/core/src/java/org/apache/lucene/codecs/lucene90/blocktree/SegmentTermsEnumFrame.java:
##
@@ -642,6 +651,97 @@ public SeekStatus scanToTermLeaf(BytesRef target, boolean 
exactOnly) throws IOEx
 return SeekStatus.END;
   }
 
+  // Target's prefix matches this block's prefix;
+  // And all suffixes have the same length in this block,
+  // we binary search the entries check if the suffix matches.

Review Comment:
   Done.
   
   > we set ste.termExists above so we could just remove this comment and the 
assert instead?
   
   > entries check -> entries to check?
   
   Should we do the same change to `scanToTermLeaf` ( maybe in another PR)?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] [Fix] Binary search the entries when all suffixes have the same length in a leaf block. [lucene]

2024-03-28 Thread via GitHub


vsop-479 commented on code in PR #11888:
URL: https://github.com/apache/lucene/pull/11888#discussion_r1543957681


##
lucene/core/src/java/org/apache/lucene/codecs/lucene90/blocktree/SegmentTermsEnumFrame.java:
##
@@ -642,6 +651,97 @@ public SeekStatus scanToTermLeaf(BytesRef target, boolean 
exactOnly) throws IOEx
 return SeekStatus.END;
   }
 
+  // Target's prefix matches this block's prefix;
+  // And all suffixes have the same length in this block,
+  // we binary search the entries check if the suffix matches.

Review Comment:
   Done.
   
   > we set ste.termExists above so we could just remove this comment and the 
assert instead?
   
   > entries check -> entries to check?
   
   Should we do the same change to `scanToTermLeaf` ( may be in another PR)?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] [Fix] Binary search the entries when all suffixes have the same length in a leaf block. [lucene]

2024-03-28 Thread via GitHub


vsop-479 commented on code in PR #11888:
URL: https://github.com/apache/lucene/pull/11888#discussion_r1543070445


##
lucene/core/src/java/org/apache/lucene/codecs/lucene90/blocktree/SegmentTermsEnumFrame.java:
##
@@ -642,6 +651,97 @@ public SeekStatus scanToTermLeaf(BytesRef target, boolean 
exactOnly) throws IOEx
 return SeekStatus.END;
   }
 
+  // Target's prefix matches this block's prefix;
+  // And all suffixes have the same length in this block,
+  // we binary search the entries check if the suffix matches.

Review Comment:
   Thanks @mikemccand , I will fix this.
   By the way, should i add a CHANGES entry for this change?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] [Fix] Binary search the entries when all suffixes have the same length in a leaf block. [lucene]

2024-03-28 Thread via GitHub


mikemccand commented on code in PR #11888:
URL: https://github.com/apache/lucene/pull/11888#discussion_r1542769731


##
lucene/core/src/java/org/apache/lucene/codecs/lucene90/blocktree/SegmentTermsEnumFrame.java:
##
@@ -642,6 +651,97 @@ public SeekStatus scanToTermLeaf(BytesRef target, boolean 
exactOnly) throws IOEx
 return SeekStatus.END;
   }
 
+  // Target's prefix matches this block's prefix;
+  // And all suffixes have the same length in this block,
+  // we binary search the entries check if the suffix matches.

Review Comment:
   `entries check` -> `entries to check`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] [Fix] Binary search the entries when all suffixes have the same length in a leaf block. [lucene]

2024-03-28 Thread via GitHub


vsop-479 commented on PR #11888:
URL: https://github.com/apache/lucene/pull/11888#issuecomment-2024550150

   Thanks for your comments @mikemccand . I have fixed them, and removed the 
stale change entry about this change.
   Please take a look when you get a chance.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] [Fix] Binary search the entries when all suffixes have the same length in a leaf block. [lucene]

2024-03-28 Thread via GitHub


vsop-479 commented on code in PR #11888:
URL: https://github.com/apache/lucene/pull/11888#discussion_r1542387588


##
lucene/core/src/java/org/apache/lucene/codecs/lucene90/blocktree/SegmentTermsEnumFrame.java:
##
@@ -642,6 +651,99 @@ public SeekStatus scanToTermLeaf(BytesRef target, boolean 
exactOnly) throws IOEx
 return SeekStatus.END;
   }
 
+  // Target's prefix matches this block's prefix;
+  // And all suffixes have the same length in this block,
+  // we binary search the entries check if the suffix matches.
+  public SeekStatus binarySearchTermLeaf(BytesRef target, boolean exactOnly) 
throws IOException {
+// if (DEBUG) System.out.println("binarySearchTermLeaf: block fp=" + 
fp + " prefix=" +
+// prefix + "
+// nextEnt=" + nextEnt + " (of " + entCount + ") target=" + 
brToString(target) + " term=" +
+// brToString(term));
+
+assert nextEnt != -1;
+
+ste.termExists = true;
+subCode = 0;
+
+if (nextEnt == entCount) {
+  if (exactOnly) {
+fillTerm();
+  }
+  return SeekStatus.END;
+}
+
+assert prefixMatches(target);
+
+suffix = suffixLengthsReader.readVInt();
+// TODO early terminate when target length unequals suffix + prefix.
+// But we need to keep the same status with scanToTermLeaf.
+int start = nextEnt;
+int end = entCount - 1;
+// Binary search the entries (terms) in this leaf block:
+int cmp = 0;
+while (start <= end) {
+  int mid = (start + end) / 2;
+  nextEnt = mid + 1;
+  startBytePos = mid * suffix;
+
+  // Binary search bytes in the suffix, comparing to the target
+  cmp =
+  Arrays.compareUnsigned(
+  suffixBytes,
+  startBytePos,
+  startBytePos + suffix,
+  target.bytes,
+  target.offset + prefix,
+  target.offset + target.length);
+  if (cmp < 0) {
+start = mid + 1;
+  } else if (cmp > 0) {
+end = mid - 1;
+  } else {
+// Exact match!
+suffixesReader.setPosition(startBytePos + suffix);
+// This cannot be a sub-block because we
+// would have followed the index to this
+// sub-block from the start:
+assert ste.termExists;
+fillTerm();
+// if (DEBUG) System.out.println("found!");
+return SeekStatus.FOUND;
+  }
+}
+
+// It is possible (and OK) that terms index pointed us
+// at this block, but, we searched the entire block and
+// did not find the term to position to.  This happens
+// when the target is after the last term in the block
+// (but, before the next term in the index).  EG
+// target could be foozzz, and terms index pointed us
+// to the foo* block, but the last term in this block
+// was fooz (and, eg, first term in the next block will
+// bee fop).
+// if (DEBUG) System.out.println("  block end");
+SeekStatus seekStatus = end < entCount - 1 ? SeekStatus.NOT_FOUND : 
SeekStatus.END;
+if (seekStatus == SeekStatus.NOT_FOUND) {

Review Comment:
   Thanks @mikemccand . This makes code more clear.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] [Fix] Binary search the entries when all suffixes have the same length in a leaf block. [lucene]

2024-03-28 Thread via GitHub


vsop-479 commented on code in PR #11888:
URL: https://github.com/apache/lucene/pull/11888#discussion_r1542363416


##
lucene/core/src/java/org/apache/lucene/codecs/lucene90/blocktree/SegmentTermsEnumFrame.java:
##
@@ -642,6 +651,99 @@ public SeekStatus scanToTermLeaf(BytesRef target, boolean 
exactOnly) throws IOEx
 return SeekStatus.END;
   }
 
+  // Target's prefix matches this block's prefix;
+  // And all suffixes have the same length in this block,
+  // we binary search the entries check if the suffix matches.
+  public SeekStatus binarySearchTermLeaf(BytesRef target, boolean exactOnly) 
throws IOException {
+// if (DEBUG) System.out.println("binarySearchTermLeaf: block fp=" + 
fp + " prefix=" +
+// prefix + "
+// nextEnt=" + nextEnt + " (of " + entCount + ") target=" + 
brToString(target) + " term=" +
+// brToString(term));
+
+assert nextEnt != -1;
+
+ste.termExists = true;
+subCode = 0;
+
+if (nextEnt == entCount) {
+  if (exactOnly) {
+fillTerm();
+  }
+  return SeekStatus.END;
+}
+
+assert prefixMatches(target);
+
+suffix = suffixLengthsReader.readVInt();
+// TODO early terminate when target length unequals suffix + prefix.
+// But we need to keep the same status with scanToTermLeaf.
+int start = nextEnt;
+int end = entCount - 1;
+// Binary search the entries (terms) in this leaf block:
+int cmp = 0;
+while (start <= end) {
+  int mid = (start + end) / 2;
+  nextEnt = mid + 1;
+  startBytePos = mid * suffix;
+
+  // Binary search bytes in the suffix, comparing to the target
+  cmp =
+  Arrays.compareUnsigned(
+  suffixBytes,
+  startBytePos,
+  startBytePos + suffix,
+  target.bytes,
+  target.offset + prefix,
+  target.offset + target.length);
+  if (cmp < 0) {
+start = mid + 1;
+  } else if (cmp > 0) {
+end = mid - 1;
+  } else {
+// Exact match!
+suffixesReader.setPosition(startBytePos + suffix);
+// This cannot be a sub-block because we
+// would have followed the index to this
+// sub-block from the start:
+assert ste.termExists;
+fillTerm();
+// if (DEBUG) System.out.println("found!");
+return SeekStatus.FOUND;
+  }
+}
+
+// It is possible (and OK) that terms index pointed us
+// at this block, but, we searched the entire block and
+// did not find the term to position to.  This happens
+// when the target is after the last term in the block
+// (but, before the next term in the index).  EG
+// target could be foozzz, and terms index pointed us
+// to the foo* block, but the last term in this block
+// was fooz (and, eg, first term in the next block will
+// bee fop).
+// if (DEBUG) System.out.println("  block end");
+SeekStatus seekStatus = end < entCount - 1 ? SeekStatus.NOT_FOUND : 
SeekStatus.END;
+if (seekStatus == SeekStatus.NOT_FOUND) {

Review Comment:
   Done.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] [Fix] Binary search the entries when all suffixes have the same length in a leaf block. [lucene]

2024-03-28 Thread via GitHub


vsop-479 commented on code in PR #11888:
URL: https://github.com/apache/lucene/pull/11888#discussion_r1542357624


##
lucene/core/src/test/org/apache/lucene/codecs/lucene99/TestLucene99PostingsFormat.java:
##
@@ -143,4 +141,13 @@ private void doTestImpactSerialization(List 
impacts) throws IOException
   }
 }
   }
+
+  @Override
+  protected void subCheckBinarySearch(TermsEnum termsEnum) throws Exception {
+// 10004a matched block's entries: [11, 13, ..., 100049].
+// if target greater than the last entry of the matched block,
+// termsEnum.term should be the last entry.
+assertFalse(termsEnum.seekExact(new BytesRef("10004a")));
+assertEquals(termsEnum.term(), new BytesRef("100049"));

Review Comment:
   > Is there a seekCeil based test case we can make?
   
   Yes, `seekCeil` can also omit an `AssertionError` without the fix of 
7084596c1c3a62dec2614aaeb37d0954f5fbd4e2.
   So i used it to replace `seekExact`. Thanks @mikemccand .



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] [Fix] Binary search the entries when all suffixes have the same length in a leaf block. [lucene]

2024-03-27 Thread via GitHub


vsop-479 commented on code in PR #11888:
URL: https://github.com/apache/lucene/pull/11888#discussion_r1542233210


##
lucene/core/src/java/org/apache/lucene/codecs/lucene90/blocktree/SegmentTermsEnumFrame.java:
##
@@ -642,6 +651,99 @@ public SeekStatus scanToTermLeaf(BytesRef target, boolean 
exactOnly) throws IOEx
 return SeekStatus.END;
   }
 
+  // Target's prefix matches this block's prefix;
+  // And all suffixes have the same length in this block,
+  // we binary search the entries check if the suffix matches.
+  public SeekStatus binarySearchTermLeaf(BytesRef target, boolean exactOnly) 
throws IOException {
+// if (DEBUG) System.out.println("binarySearchTermLeaf: block fp=" + 
fp + " prefix=" +
+// prefix + "
+// nextEnt=" + nextEnt + " (of " + entCount + ") target=" + 
brToString(target) + " term=" +
+// brToString(term));
+
+assert nextEnt != -1;
+
+ste.termExists = true;
+subCode = 0;
+
+if (nextEnt == entCount) {
+  if (exactOnly) {
+fillTerm();
+  }
+  return SeekStatus.END;
+}
+
+assert prefixMatches(target);
+
+suffix = suffixLengthsReader.readVInt();
+// TODO early terminate when target length unequals suffix + prefix.
+// But we need to keep the same status with scanToTermLeaf.
+int start = nextEnt;
+int end = entCount - 1;
+// Binary search the entries (terms) in this leaf block:
+int cmp = 0;
+while (start <= end) {
+  int mid = (start + end) / 2;

Review Comment:
   Good catch. I will do it.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] [Fix] Binary search the entries when all suffixes have the same length in a leaf block. [lucene]

2024-03-27 Thread via GitHub


vsop-479 commented on code in PR #11888:
URL: https://github.com/apache/lucene/pull/11888#discussion_r1542231368


##
lucene/core/src/java/org/apache/lucene/codecs/lucene90/blocktree/SegmentTermsEnumFrame.java:
##
@@ -642,6 +651,99 @@ public SeekStatus scanToTermLeaf(BytesRef target, boolean 
exactOnly) throws IOEx
 return SeekStatus.END;
   }
 
+  // Target's prefix matches this block's prefix;
+  // And all suffixes have the same length in this block,
+  // we binary search the entries check if the suffix matches.
+  public SeekStatus binarySearchTermLeaf(BytesRef target, boolean exactOnly) 
throws IOException {
+// if (DEBUG) System.out.println("binarySearchTermLeaf: block fp=" + 
fp + " prefix=" +
+// prefix + "
+// nextEnt=" + nextEnt + " (of " + entCount + ") target=" + 
brToString(target) + " term=" +
+// brToString(term));
+
+assert nextEnt != -1;
+
+ste.termExists = true;
+subCode = 0;
+
+if (nextEnt == entCount) {
+  if (exactOnly) {
+fillTerm();
+  }
+  return SeekStatus.END;
+}
+
+assert prefixMatches(target);
+
+suffix = suffixLengthsReader.readVInt();
+// TODO early terminate when target length unequals suffix + prefix.
+// But we need to keep the same status with scanToTermLeaf.
+int start = nextEnt;
+int end = entCount - 1;
+// Binary search the entries (terms) in this leaf block:
+int cmp = 0;
+while (start <= end) {
+  int mid = (start + end) / 2;
+  nextEnt = mid + 1;
+  startBytePos = mid * suffix;
+
+  // Binary search bytes in the suffix, comparing to the target
+  cmp =
+  Arrays.compareUnsigned(
+  suffixBytes,
+  startBytePos,
+  startBytePos + suffix,
+  target.bytes,
+  target.offset + prefix,
+  target.offset + target.length);
+  if (cmp < 0) {
+start = mid + 1;
+  } else if (cmp > 0) {
+end = mid - 1;
+  } else {
+// Exact match!
+suffixesReader.setPosition(startBytePos + suffix);
+// This cannot be a sub-block because we
+// would have followed the index to this
+// sub-block from the start:
+assert ste.termExists;

Review Comment:
   I will remove it.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] [Fix] Binary search the entries when all suffixes have the same length in a leaf block. [lucene]

2024-03-27 Thread via GitHub


mikemccand commented on code in PR #11888:
URL: https://github.com/apache/lucene/pull/11888#discussion_r1541971598


##
lucene/core/src/java/org/apache/lucene/codecs/lucene90/blocktree/SegmentTermsEnumFrame.java:
##
@@ -642,6 +651,99 @@ public SeekStatus scanToTermLeaf(BytesRef target, boolean 
exactOnly) throws IOEx
 return SeekStatus.END;
   }
 
+  // Target's prefix matches this block's prefix;
+  // And all suffixes have the same length in this block,
+  // we binary search the entries check if the suffix matches.
+  public SeekStatus binarySearchTermLeaf(BytesRef target, boolean exactOnly) 
throws IOException {
+// if (DEBUG) System.out.println("binarySearchTermLeaf: block fp=" + 
fp + " prefix=" +
+// prefix + "
+// nextEnt=" + nextEnt + " (of " + entCount + ") target=" + 
brToString(target) + " term=" +
+// brToString(term));
+
+assert nextEnt != -1;
+
+ste.termExists = true;
+subCode = 0;
+
+if (nextEnt == entCount) {
+  if (exactOnly) {
+fillTerm();
+  }
+  return SeekStatus.END;
+}
+
+assert prefixMatches(target);
+
+suffix = suffixLengthsReader.readVInt();
+// TODO early terminate when target length unequals suffix + prefix.
+// But we need to keep the same status with scanToTermLeaf.
+int start = nextEnt;
+int end = entCount - 1;
+// Binary search the entries (terms) in this leaf block:
+int cmp = 0;
+while (start <= end) {
+  int mid = (start + end) / 2;

Review Comment:
   It surely won't matter for this particular binary search but can we replace 
the division by 2 with logical right shift `>>> 1` instead, to avoid even the 
appearance of the [classic binary search overflow 
bug](https://thebittheories.com/the-curious-case-of-binary-search-the-famous-bug-that-remained-undetected-for-20-years-973e89fc212)?



##
lucene/core/src/java/org/apache/lucene/codecs/lucene90/blocktree/SegmentTermsEnumFrame.java:
##
@@ -642,6 +651,99 @@ public SeekStatus scanToTermLeaf(BytesRef target, boolean 
exactOnly) throws IOEx
 return SeekStatus.END;
   }
 
+  // Target's prefix matches this block's prefix;
+  // And all suffixes have the same length in this block,
+  // we binary search the entries check if the suffix matches.
+  public SeekStatus binarySearchTermLeaf(BytesRef target, boolean exactOnly) 
throws IOException {
+// if (DEBUG) System.out.println("binarySearchTermLeaf: block fp=" + 
fp + " prefix=" +
+// prefix + "
+// nextEnt=" + nextEnt + " (of " + entCount + ") target=" + 
brToString(target) + " term=" +
+// brToString(term));
+
+assert nextEnt != -1;
+
+ste.termExists = true;
+subCode = 0;
+
+if (nextEnt == entCount) {
+  if (exactOnly) {
+fillTerm();
+  }
+  return SeekStatus.END;
+}
+
+assert prefixMatches(target);
+
+suffix = suffixLengthsReader.readVInt();
+// TODO early terminate when target length unequals suffix + prefix.
+// But we need to keep the same status with scanToTermLeaf.
+int start = nextEnt;
+int end = entCount - 1;
+// Binary search the entries (terms) in this leaf block:
+int cmp = 0;
+while (start <= end) {
+  int mid = (start + end) / 2;
+  nextEnt = mid + 1;
+  startBytePos = mid * suffix;
+
+  // Binary search bytes in the suffix, comparing to the target
+  cmp =
+  Arrays.compareUnsigned(
+  suffixBytes,
+  startBytePos,
+  startBytePos + suffix,
+  target.bytes,
+  target.offset + prefix,
+  target.offset + target.length);
+  if (cmp < 0) {
+start = mid + 1;
+  } else if (cmp > 0) {
+end = mid - 1;
+  } else {
+// Exact match!
+suffixesReader.setPosition(startBytePos + suffix);
+// This cannot be a sub-block because we
+// would have followed the index to this
+// sub-block from the start:
+assert ste.termExists;
+fillTerm();
+// if (DEBUG) System.out.println("found!");
+return SeekStatus.FOUND;
+  }
+}
+
+// It is possible (and OK) that terms index pointed us
+// at this block, but, we searched the entire block and
+// did not find the term to position to.  This happens
+// when the target is after the last term in the block
+// (but, before the next term in the index).  EG
+// target could be foozzz, and terms index pointed us
+// to the foo* block, but the last term in this block
+// was fooz (and, eg, first term in the next block will
+// bee fop).
+// if (DEBUG) System.out.println("  block end");
+SeekStatus seekStatus = end < entCount - 1 ? SeekStatus.NOT_FOUND : 
SeekStatus.END;
+if (seekStatus == SeekStatus.NOT_FOUND) {
+  // If binary search ended at the less term, and greater term exists.
+  // We need to advance to the greater term.
+  if (cmp < 0) 

Re: [PR] [Fix] Binary search the entries when all suffixes have the same length in a leaf block. [lucene]

2024-03-27 Thread via GitHub


mikemccand commented on code in PR #11888:
URL: https://github.com/apache/lucene/pull/11888#discussion_r1541892891


##
lucene/core/src/test/org/apache/lucene/codecs/lucene99/TestLucene99PostingsFormat.java:
##
@@ -143,4 +141,13 @@ private void doTestImpactSerialization(List 
impacts) throws IOException
   }
 }
   }
+
+  @Override
+  protected void subCheckBinarySearch(TermsEnum termsEnum) throws Exception {
+// 10004a matched block's entries: [11, 13, ..., 100049].
+// if target greater than the last entry of the matched block,
+// termsEnum.term should be the last entry.
+assertFalse(termsEnum.seekExact(new BytesRef("10004a")));
+assertEquals(termsEnum.term(), new BytesRef("100049"));

Review Comment:
   Well, I think we need to find a way to test this bug w/o abusing the API.  
Our tests should not violate our APIs ...
   
   Is there a `seekCeil` based test case we can make?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] [Fix] Binary search the entries when all suffixes have the same length in a leaf block. [lucene]

2024-03-27 Thread via GitHub


vsop-479 commented on PR #11888:
URL: https://github.com/apache/lucene/pull/11888#issuecomment-2022219035

   @mikemccand Thanks for your review.
   I measured performance on `wikimediumall`:
   
   # iter1
   
   TaskQPS baseline  StdDevQPS my_modified_version  StdDev  
  Pct diff p-value
  BrowseMonthTaxoFacets   11.02 (26.9%)   10.62 
(28.4%)   -3.7% ( -46% -   70%) 0.727
BrowseRandomLabelSSDVFacets6.22  (9.1%)6.04  
(6.5%)   -2.9% ( -16% -   13%) 0.335
  HighTermTitleSort  184.11  (4.0%)  180.43  
(4.0%)   -2.0% (  -9% -6%) 0.187
 TermDTSort  214.19  (4.6%)  210.41  
(4.2%)   -1.8% ( -10% -7%) 0.290
  HighTermMonthSort 4049.75  (3.8%) 4009.22  
(6.2%)   -1.0% ( -10% -9%) 0.606
 OrHighMedDayTaxoFacets6.41  (6.7%)6.35  
(7.5%)   -0.9% ( -14% -   14%) 0.737
Prefix3  562.57  (1.3%)  558.72  
(1.7%)   -0.7% (  -3% -2%) 0.228
   AndHighHighDayTaxoFacets   23.16  (1.5%)   23.02  
(1.6%)   -0.6% (  -3% -2%) 0.288
 HighPhrase   49.14  (3.9%)   48.88  
(3.1%)   -0.5% (  -7% -6%) 0.698
   MedTermDayTaxoFacets   17.63  (3.6%)   17.55  
(2.9%)   -0.4% (  -6% -6%) 0.718
   HighSpanNear   15.30  (1.8%)   15.24  
(1.5%)   -0.4% (  -3% -3%) 0.544
   HighTermTitleBDVSort   10.23  (2.4%)   10.19  
(3.0%)   -0.4% (  -5% -5%) 0.717
  BrowseDayOfYearSSDVFacets6.97  (6.3%)6.95  
(6.0%)   -0.3% ( -11% -   12%) 0.890
Respell   73.65  (1.5%)   73.44  
(2.4%)   -0.3% (  -4% -3%) 0.717
  HighTermDayOfYearSort  525.93  (2.5%)  524.54  
(3.0%)   -0.3% (  -5% -5%) 0.799
MedSpanNear   75.25  (2.6%)   75.16  
(1.5%)   -0.1% (  -4% -3%) 0.882
  MedPhrase   62.74  (4.7%)   62.73  
(2.6%)   -0.0% (  -6% -7%) 0.987
LowSpanNear   10.39  (2.1%)   10.39  
(1.4%)0.0% (  -3% -3%) 0.998
 Fuzzy1   99.01  (1.7%)   99.07  
(1.6%)0.1% (  -3% -3%) 0.923
  OrNotHighHigh  576.76  (4.0%)  578.77  
(4.5%)0.3% (  -7% -9%) 0.827
AndHighMedDayTaxoFacets   80.00  (1.3%)   80.29  
(1.6%)0.4% (  -2% -3%) 0.511
  LowPhrase  149.97  (2.9%)  150.57  
(2.1%)0.4% (  -4% -5%) 0.675
  OrHighLow  675.00  (2.6%)  678.06  
(3.1%)0.5% (  -5% -6%) 0.674
   HighIntervalsOrdered2.81 (14.0%)2.83 
(11.6%)0.5% ( -22% -   30%) 0.921
 AndHighLow 1027.38  (4.2%) 1032.64  
(3.9%)0.5% (  -7% -8%) 0.738
   Wildcard  100.84  (2.4%)  101.44  
(2.8%)0.6% (  -4% -5%) 0.547
 Fuzzy2   92.33  (1.5%)   92.98  
(1.4%)0.7% (  -2% -3%) 0.206
MedIntervalsOrdered   13.07 (10.5%)   13.18  
(9.4%)0.8% ( -17% -   23%) 0.824
  BrowseMonthSSDVFacets6.91  (8.0%)6.97  
(7.1%)0.9% ( -13% -   17%) 0.748
   OrNotHighMed  341.39  (3.4%)  345.26  
(3.2%)1.1% (  -5% -7%) 0.362
 AndHighMed  155.80  (3.0%)  157.63  
(3.2%)1.2% (  -4% -7%) 0.316
  BrowseDayOfYearTaxoFacets7.70  (4.0%)7.79  
(4.4%)1.2% (  -6% -9%) 0.450
 OrHighHigh   42.26  (3.9%)   42.77  
(3.6%)1.2% (  -6% -9%) 0.396
BrowseRandomLabelTaxoFacets7.17  (5.0%)7.26  
(4.5%)1.2% (  -7% -   11%) 0.486
  OrHighNotHigh  464.54  (4.5%)  470.46  
(5.2%)1.3% (  -8% -   11%) 0.490
LowTerm  669.77  (3.4%)  678.39  
(4.3%)1.3% (  -6% -9%) 0.383
  OrHighMed  118.91  (3.2%)  120.47  
(3.1%)1.3% (  -4% -7%) 0.270
LowIntervalsOrdered   63.73  (8.3%)   64.58  
(7.7%)1.3% ( -13% -   18%) 0.657
   BrowseDateTaxoFacets7.63  (3.8%)7.74  
(3.8%)1.4% (  -5% -9%) 0.324
AndHighHigh   30.16  (5.2%)   30.61  
(2.2%)1.5% (  -5% -9%) 0.323
   OrNotHighLow 1186.21  (4.7%) 1203.99  

Re: [PR] [Fix] Binary search the entries when all suffixes have the same length in a leaf block. [lucene]

2024-03-27 Thread via GitHub


vsop-479 commented on code in PR #11888:
URL: https://github.com/apache/lucene/pull/11888#discussion_r1540528182


##
lucene/core/src/test/org/apache/lucene/codecs/lucene99/TestLucene99PostingsFormat.java:
##
@@ -143,4 +141,13 @@ private void doTestImpactSerialization(List 
impacts) throws IOException
   }
 }
   }
+
+  @Override
+  protected void subCheckBinarySearch(TermsEnum termsEnum) throws Exception {
+// 10004a matched block's entries: [11, 13, ..., 100049].
+// if target greater than the last entry of the matched block,
+// termsEnum.term should be the last entry.
+assertFalse(termsEnum.seekExact(new BytesRef("10004a")));
+assertEquals(termsEnum.term(), new BytesRef("100049"));

Review Comment:
   > why are we testing that here :)
   
   Since there was a bug(fixed by 7084596c1c3a62dec2614aaeb37d0954f5fbd4e2) in 
previous implementation.
   So i added this test to watch it.
   
   Should i remove it?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] [Fix] Binary search the entries when all suffixes have the same length in a leaf block. [lucene]

2024-03-26 Thread via GitHub


vsop-479 commented on PR #11888:
URL: https://github.com/apache/lucene/pull/11888#issuecomment-2021791458

   > Was this on wikimediumall?
   
   No, this was on `wikimedium10k`. 
   I will measure the performance again on `wikimediumall`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] [Fix] Binary search the entries when all suffixes have the same length in a leaf block. [lucene]

2024-03-26 Thread via GitHub


mikemccand commented on code in PR #11888:
URL: https://github.com/apache/lucene/pull/11888#discussion_r1539192140


##
lucene/core/src/java/org/apache/lucene/codecs/lucene90/blocktree/SegmentTermsEnumFrame.java:
##
@@ -523,7 +526,9 @@ public void scanToSubBlock(long subFP) {
 
   // NOTE: sets startBytePos/suffix as a side effect
   public SeekStatus scanToTerm(BytesRef target, boolean exactOnly) throws 
IOException {
-return isLeafBlock ? scanToTermLeaf(target, exactOnly) : 
scanToTermNonLeaf(target, exactOnly);
+return isLeafBlock

Review Comment:
   I know this was a pre-existing ternary :)   But now we are embedding another 
confusing ternary inside the first one -- could we instead spell all of this 
out as verbose `if`?



##
lucene/core/src/java/org/apache/lucene/codecs/lucene90/blocktree/SegmentTermsEnumFrame.java:
##
@@ -568,8 +573,6 @@ public SeekStatus scanToTermLeaf(BytesRef target, boolean 
exactOnly) throws IOEx
 
 assert prefixMatches(target);
 
-// TODO: binary search when all terms have the same length, which is 
common for ID fields,

Review Comment:
   Aha!  Another `TODO` gone, thank you @vsop-479!



##
lucene/core/src/test/org/apache/lucene/codecs/lucene99/TestLucene99PostingsFormat.java:
##
@@ -143,4 +141,13 @@ private void doTestImpactSerialization(List 
impacts) throws IOException
   }
 }
   }
+
+  @Override
+  protected void subCheckBinarySearch(TermsEnum termsEnum) throws Exception {
+// 10004a matched block's entries: [11, 13, ..., 100049].
+// if target greater than the last entry of the matched block,
+// termsEnum.term should be the last entry.
+assertFalse(termsEnum.seekExact(new BytesRef("10004a")));
+assertEquals(termsEnum.term(), new BytesRef("100049"));

Review Comment:
   Hmm, when `seekExact` returns `false`, the `TermsEnum` is unpositioned and 
calling `.term()` (and other methods e.g. `.postings()`) is not allowed (the 
behavior is undefined -- it could throw an exception or corrupt its internal 
state or so) ... why are we testing that here :)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] [Fix] Binary search the entries when all suffixes have the same length in a leaf block. [lucene]

2024-03-26 Thread via GitHub


mikemccand commented on PR #11888:
URL: https://github.com/apache/lucene/pull/11888#issuecomment-2020383140

   I like this idea!  It seems like it'd especially help primary key lookup 
against fixed length IDs like UUID?
   
   Hmm, the QPS in the `luceneutil` runs are way too high (1000s of QPS) to be 
trustworthy?  Was this on `wikimediumall`?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] [Fix] Binary search the entries when all suffixes have the same length in a leaf block. [lucene]

2024-03-15 Thread via GitHub


vsop-479 commented on PR #11888:
URL: https://github.com/apache/lucene/pull/11888#issuecomment-1999256752

   @jpountz 
   I want to move `subCheckBinarySearch` to `BasePostingsFormatTestCase` to 
make this change forward compatible, by judging whether `IndexWriterConfig`  is 
set DefaultPostingsFormat like this:
   
   if (TestUtil.getDefaultPostingsFormat()
   .getName()
   .equals(TestUtil.getPostingsFormat(iwc.getCodec(), "id"))) {
 // test target greater than the last entry of matched block,
   }
   
   But it won't pass if this DefaultPostingsFormat do not use 
`DEFAULT_MIN_BLOCK_SIZE` and `DEFAULT_MAX_BLOCK_SIZE`, such as 
`TestPerFieldPostingsFormat`.
   
   
   I also tried to set DefaultCodec to test target greater than the last entry 
of matched block case, like this:
   
   iwc.setCodec(TestUtil.getDefaultCodec());
   
   `TestSTUniformSplitPostingFormat.checkEncoding`  won't pass this, because it 
must use its own `FieldsConsumer` to set states like `blocksEncoded`.
   
   Do you have any idea about this? 
   Can we expose `minTermBlockSize` and `maxTermBlockSize` in 
`LuceneXXPostingsFormat` to a `DefaultPostingsFormat`, to let user use them?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] [Fix] Binary search the entries when all suffixes have the same length in a leaf block. [lucene]

2024-02-19 Thread via GitHub


github-actions[bot] commented on PR #11888:
URL: https://github.com/apache/lucene/pull/11888#issuecomment-1953304146

   This PR has not had activity in the past 2 weeks, labeling it as stale. If 
the PR is waiting for review, notify the d...@lucene.apache.org list. Thank you 
for your contribution!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] [Fix] Binary search the entries when all suffixes have the same length in a leaf block. [lucene]

2024-02-05 Thread via GitHub


vsop-479 commented on PR #11888:
URL: https://github.com/apache/lucene/pull/11888#issuecomment-1926589435

   @jpountz 
   Can we push on this change by checking whether our test case has covered all 
the status, that `TermsEnum.seekExact` or  `TermsEnum.seekCeil` may emit?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] [Fix] Binary search the entries when all suffixes have the same length in a leaf block. [lucene]

2024-01-28 Thread via GitHub


vsop-479 commented on PR #11888:
URL: https://github.com/apache/lucene/pull/11888#issuecomment-1914149013

   @jpountz @mikemccand 
   I resolved the conflicts, and moved the test case for target greater than 
the last entry of matched block from `TestLucene90PostingsFormat` to 
`TestLucene99PostingsFormat`.
   Please take a look when you get a chance!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] [Fix] Binary search the entries when all suffixes have the same length in a leaf block. [lucene]

2024-01-23 Thread via GitHub


github-actions[bot] commented on PR #11888:
URL: https://github.com/apache/lucene/pull/11888#issuecomment-1907130580

   This PR has not had activity in the past 2 weeks, labeling it as stale. If 
the PR is waiting for review, notify the d...@lucene.apache.org list. Thank you 
for your contribution!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] [Fix] Binary search the entries when all suffixes have the same length in a leaf block. [lucene]

2024-01-08 Thread via GitHub


jpountz commented on PR #11888:
URL: https://github.com/apache/lucene/pull/11888#issuecomment-1881059520

   @mikemccand I could use your help to review this change, it's quite deep in 
the guts of block tree.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] [Fix] Binary search the entries when all suffixes have the same length in a leaf block. [lucene]

2024-01-08 Thread via GitHub


github-actions[bot] commented on PR #11888:
URL: https://github.com/apache/lucene/pull/11888#issuecomment-1880904269

   This PR has not had activity in the past 2 weeks, labeling it as stale. If 
the PR is waiting for review, notify the d...@lucene.apache.org list. Thank you 
for your contribution!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] [Fix] Binary search the entries when all suffixes have the same length in a leaf block. [lucene]

2023-10-12 Thread via GitHub


vsop-479 commented on PR #11888:
URL: https://github.com/apache/lucene/pull/11888#issuecomment-1759050886

   Append some performance data. Note that the results have quite diversity 
from different rounds.
   
   # round1
   Task QPS baseline   StdDev   QPS bsearch
StdDev  Pct diffp-value
BrowseRandomLabelTaxoFacets 2601.88  (5.0%) 2437.93 
(11.1%)   -6.3% ( -21% -   10%) 0.021
BrowseRandomLabelSSDVFacets 1796.93  (7.1%) 1696.76 
 (8.0%)   -5.6% ( -19% -   10%) 0.019
  MedPhrase 4021.51  (6.7%) 3813.81 
 (7.6%)   -5.2% ( -18% -9%) 0.022
 HighPhrase  821.97  (7.3%)  788.46 
 (6.3%)   -4.1% ( -16% -   10%) 0.059
   HighTerm 6221.04  (7.3%) 6032.11 
 (5.6%)   -3.0% ( -14% -   10%) 0.138
 Fuzzy2  276.46  (6.0%)  268.79 
 (4.8%)   -2.8% ( -12% -8%) 0.105
Respell  707.62  (5.9%)  692.48 
 (4.9%)   -2.1% ( -12% -9%) 0.211
  BrowseDayOfYearSSDVFacets 6517.91  (5.4%) 6392.55 
 (6.0%)   -1.9% ( -12% -   10%) 0.287
   BrowseDateSSDVFacets 2195.69 (19.2%) 2155.70 
 (9.7%)   -1.8% ( -25% -   33%) 0.705
  BrowseMonthSSDVFacets 6724.35  (6.7%) 6606.24 
 (6.1%)   -1.8% ( -13% -   11%) 0.386
Prefix3 3023.99  (4.5%) 2974.60 
 (5.3%)   -1.6% ( -10% -8%) 0.293
 Fuzzy1  886.09  (5.1%)  877.56 
 (6.0%)   -1.0% ( -11% -   10%) 0.587
   PKLookup  396.87 (15.8%)  393.83 
(19.2%)   -0.8% ( -30% -   40%) 0.890
  OrHighLow 3242.80  (6.0%) 3223.53 
 (7.2%)   -0.6% ( -13% -   13%) 0.778
  BrowseMonthTaxoFacets 5874.67  (6.1%) 5856.79 
 (5.1%)   -0.3% ( -10% -   11%) 0.865
 IntNRQ 2799.54  (4.7%) 2792.69 
 (5.4%)   -0.2% (  -9% -   10%) 0.878
LowIntervalsOrdered 1336.37  (8.1%) 1333.20 
 (5.6%)   -0.2% ( -12% -   14%) 0.914
   HighSpanNear 2660.49  (6.7%) 2654.45 
 (6.1%)   -0.2% ( -12% -   13%) 0.911
LowTerm 9965.56  (8.1%) 9961.77 
(10.6%)   -0.0% ( -17% -   20%) 0.990
AndHighHigh 3384.43  (7.0%) 3388.41 
(11.3%)0.1% ( -16% -   19%) 0.968
   HighSloppyPhrase 1984.76  (5.8%) 1988.83 
 (5.3%)0.2% ( -10% -   12%) 0.908
MedIntervalsOrdered 7914.54  (8.2%) 7944.02 
(10.2%)0.4% ( -16% -   20%) 0.899
 AndHighMed 4097.29  (7.7%) 4121.75 
 (8.0%)0.6% ( -14% -   17%) 0.811
LowSpanNear 5107.67  (9.3%) 5145.19 
 (7.6%)0.7% ( -14% -   19%) 0.785
  HighTermMonthSort 3221.73  (5.2%) 3245.54 
 (8.5%)0.7% ( -12% -   15%) 0.739
   HighIntervalsOrdered 1333.81  (7.6%) 1349.72 
 (5.3%)1.2% ( -10% -   15%) 0.564
  LowPhrase 5029.07  (8.3%) 5091.95 
(11.1%)1.3% ( -16% -   22%) 0.687
   Wildcard 1327.36  (3.8%) 1346.91 
 (3.4%)1.5% (  -5% -9%) 0.197
 AndHighLow 4382.38  (7.8%) 4447.59 
 (6.9%)1.5% ( -12% -   17%) 0.524
  OrHighMed 3121.72  (7.2%) 3169.60 
 (6.4%)1.5% ( -11% -   16%) 0.478
  HighTermDayOfYearSort 3766.72  (6.2%) 3825.04 
 (7.2%)1.5% ( -11% -   15%) 0.467
MedTerm 8666.16  (7.3%) 8841.37 
 (8.1%)2.0% ( -12% -   18%) 0.406
LowSloppyPhrase 3303.11  (6.5%) 3374.89 
 (7.8%)2.2% ( -11% -   17%) 0.341
 OrHighHigh 2458.90  (7.7%) 2512.82 
 (5.2%)2.2% (  -9% -   16%) 0.289
   BrowseDateTaxoFacets 6229.11  (5.5%) 6366.43 
 (5.7%)2.2% (  -8% -   14%) 0.211
  BrowseDayOfYearTaxoFacets 5695.81  (6.8%) 5830.89 
 (6.6%)2.4% ( -10% -   16%) 0.265
MedSpanNear 3161.49  (6.8%) 3242.45 
 (5.4%)2.6% (  -8% -   15%) 0.186
MedSloppyPhrase 3363.11  (7.0%) 3456.66 
 (7.6%)2.8% ( -11% -   18%) 0.230
   
   # round2
   
   Task QPS baseline   StdDev   QPS bsearch 
   StdDev  Pct diff