[GitHub] [lucene] jmazanec15 commented on a diff in pull request #12050: Reuse HNSW graph for intialization during merge

2023-01-31 Thread via GitHub


jmazanec15 commented on code in PR #12050:
URL: https://github.com/apache/lucene/pull/12050#discussion_r1092337143


##
lucene/core/src/java/org/apache/lucene/util/hnsw/NeighborQueue.java:
##
@@ -56,6 +56,8 @@ long apply(long v) {
   // Whether the search stopped early because it reached the visited nodes 
limit
   private boolean incomplete;
 
+  public static final NeighborQueue EMPTY_MAX_HEAP_NEIGHBOR_QUEUE = new 
NeighborQueue(1, true);

Review Comment:
   You are right, I did not think about this. Given how much mutable state 
there is, I am wondering if it might just be better to get rid of this. What do 
you think?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jmazanec15 commented on a diff in pull request #12050: Reuse HNSW graph for intialization during merge

2023-01-31 Thread via GitHub


jmazanec15 commented on code in PR #12050:
URL: https://github.com/apache/lucene/pull/12050#discussion_r1092275814


##
lucene/core/src/java/org/apache/lucene/codecs/lucene95/Lucene95HnswVectorsWriter.java:
##
@@ -489,6 +485,220 @@ public void mergeOneField(FieldInfo fieldInfo, MergeState 
mergeState) throws IOE
 }
   }
 
+  private HnswGraphBuilder createFloatVectorHnswGraphBuilder(

Review Comment:
   Oh I see. Makes sense. I updated.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jmazanec15 commented on a diff in pull request #12050: Reuse HNSW graph for intialization during merge

2023-01-30 Thread via GitHub


jmazanec15 commented on code in PR #12050:
URL: https://github.com/apache/lucene/pull/12050#discussion_r1091341226


##
lucene/core/src/java/org/apache/lucene/codecs/lucene95/Lucene95HnswVectorsWriter.java:
##
@@ -489,6 +485,220 @@ public void mergeOneField(FieldInfo fieldInfo, MergeState 
mergeState) throws IOE
 }
   }
 
+  private HnswGraphBuilder createFloatVectorHnswGraphBuilder(

Review Comment:
   I cant think of a good way to do this. HnswGraphBuilder  already uses 
generics and requires that the same generic be passed in for the 
RandomVectorValues. So I think some branching logic will be required no matter 
what. Did you have an idea for this?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jmazanec15 commented on a diff in pull request #12050: Reuse HNSW graph for intialization during merge

2023-01-30 Thread via GitHub


jmazanec15 commented on code in PR #12050:
URL: https://github.com/apache/lucene/pull/12050#discussion_r1091332482


##
lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraph.java:
##
@@ -182,10 +201,43 @@ public int nextInt() {
 public boolean hasNext() {
   return cur < size;
 }
+  }
 
-/** The number of elements in this iterator * */
-public int size() {
-  return size;
+  /** Nodes iterator based on set representation of nodes. */
+  public static class SetNodesIterator extends NodesIterator {

Review Comment:
   Actually, it is required to have a "size()" method, so a better name might 
be CollectionNodesIterator.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jmazanec15 commented on a diff in pull request #12050: Reuse HNSW graph for intialization during merge

2023-01-30 Thread via GitHub


jmazanec15 commented on code in PR #12050:
URL: https://github.com/apache/lucene/pull/12050#discussion_r1091331389


##
lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraph.java:
##
@@ -182,10 +201,43 @@ public int nextInt() {
 public boolean hasNext() {
   return cur < size;
 }
+  }
 
-/** The number of elements in this iterator * */
-public int size() {
-  return size;
+  /** Nodes iterator based on set representation of nodes. */
+  public static class SetNodesIterator extends NodesIterator {

Review Comment:
   Yes, this is a good point. What about WrappedNodesIterator?



##
lucene/core/src/java/org/apache/lucene/util/hnsw/OnHeapHnswGraph.java:
##
@@ -33,44 +32,36 @@
 public final class OnHeapHnswGraph extends HnswGraph implements Accountable {
 
   private int numLevels; // the current number of levels in the graph
-  private int entryNode; // the current graph entry node on the top level
+  private int entryNode; // the current graph entry node on the top level. -1 
if not set
 
-  // Nodes by level expressed as the level 0's nodes' ordinals.
-  // As level 0 contains all nodes, nodesByLevel.get(0) is null.
-  private final List nodesByLevel;
-
-  // graph is a list of graph levels.
-  // Each level is represented as List – nodes' connections on 
this level.
+  // Level 0 is represented as List – nodes' connections on 
level 0.
   // Each entry in the list has the top maxConn/maxConn0 neighbors of a node. 
The nodes correspond
   // to vectors
   // added to HnswBuilder, and the node values are the ordinals of those 
vectors.
   // Thus, on all levels, neighbors expressed as the level 0's nodes' ordinals.
-  private final List> graph;
+  private final List graphLevel0;
+  // Represents levels 1-N. Each level is represented with a TreeMap that maps 
a levels level 0
+  // ordinal to its
+  // neighbors on that level.
+  private final List> graphUpperLevels;

Review Comment:
   Will add a comment.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jmazanec15 commented on a diff in pull request #12050: Reuse HNSW graph for intialization during merge

2023-01-30 Thread via GitHub


jmazanec15 commented on code in PR #12050:
URL: https://github.com/apache/lucene/pull/12050#discussion_r1091106407


##
lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraphBuilder.java:
##
@@ -143,10 +148,64 @@ public OnHeapHnswGraph build(RandomAccessVectorValues 
vectorsToAdd) throws IOExc
 return hnsw;
   }
 
+  /**
+   * Initializes the graph of this builder. Transfers the nodes and their 
neighbors from the
+   * initializer graph into the graph being produced by this builder, mapping 
ordinals from the
+   * initializer graph to their new ordinals in this builder's graph. The 
builder's graph must be
+   * empty before calling this method.
+   *
+   * @param initializerGraph graph used for initialization
+   * @param oldToNewOrdinalMap map for converting from ordinals in the 
initializerGraph to this
+   * builder's graph
+   */
+  public void initializeFromGraph(
+  HnswGraph initializerGraph, Map oldToNewOrdinalMap) 
throws IOException {
+assert hnsw.size() == 0;

Review Comment:
   Makes sense. Will update.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jmazanec15 commented on a diff in pull request #12050: Reuse HNSW graph for intialization during merge

2023-01-19 Thread GitBox


jmazanec15 commented on code in PR #12050:
URL: https://github.com/apache/lucene/pull/12050#discussion_r1081896861


##
lucene/core/src/java/org/apache/lucene/util/hnsw/OnHeapHnswGraph.java:
##
@@ -94,36 +93,83 @@ public int size() {
   }
 
   /**
-   * Add node on the given level
+   * Add node on the given level. Nodes can be inserted out of order, but it 
requires that the nodes

Review Comment:
   Updated structure to use treemap to represent upper levels of graph.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jmazanec15 commented on a diff in pull request #12050: Reuse HNSW graph for intialization during merge

2023-01-18 Thread GitBox


jmazanec15 commented on code in PR #12050:
URL: https://github.com/apache/lucene/pull/12050#discussion_r1080646383


##
lucene/core/src/java/org/apache/lucene/util/hnsw/OnHeapHnswGraph.java:
##
@@ -94,36 +93,83 @@ public int size() {
   }
 
   /**
-   * Add node on the given level
+   * Add node on the given level. Nodes can be inserted out of order, but it 
requires that the nodes

Review Comment:
   Added a commit for it here: 
https://github.com/jmazanec15/lucene/commit/9c54de56fa37a35bdff241abd9ebe3a6f1d8ba3a.
 Running some performance tests to compare results.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jmazanec15 commented on a diff in pull request #12050: Reuse HNSW graph for intialization during merge

2023-01-13 Thread GitBox


jmazanec15 commented on code in PR #12050:
URL: https://github.com/apache/lucene/pull/12050#discussion_r1069901702


##
lucene/core/src/java/org/apache/lucene/util/hnsw/OnHeapHnswGraph.java:
##
@@ -94,36 +93,83 @@ public int size() {
   }
 
   /**
-   * Add node on the given level
+   * Add node on the given level. Nodes can be inserted out of order, but it 
requires that the nodes

Review Comment:
   Oh I see what you mean. Yes, that makes sense. 
   
   I think for level 0, we will still want to use a List 
because all nodes will eventually be present in this level. However, for levels 
> 0, we can use a TreeMap and then add an iterator over the keys of that map.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jmazanec15 commented on a diff in pull request #12050: Reuse HNSW graph for intialization during merge

2023-01-12 Thread GitBox


jmazanec15 commented on code in PR #12050:
URL: https://github.com/apache/lucene/pull/12050#discussion_r1068562367


##
lucene/core/src/java/org/apache/lucene/util/hnsw/OnHeapHnswGraph.java:
##
@@ -94,36 +93,83 @@ public int size() {
   }
 
   /**
-   * Add node on the given level
+   * Add node on the given level. Nodes can be inserted out of order, but it 
requires that the nodes

Review Comment:
   Thinking about this more, one issue with a TreeSet based approach would be 
that we wouldnt be able to look up the index of a particular element (unless we 
converted the set to a list each time we need to do lookup). We need this index 
to query the OnHeapHnswGraph.graph to get the NeighborArray for a particular 
element 
([ref](https://github.com/jmazanec15/lucene/blob/hnsw-merge-from-graph/lucene/core/src/java/org/apache/lucene/util/hnsw/OnHeapHnswGraph.java#L87)).
 I am not sure if there is a good way around this - are you aware of any?
   
   Alternatively, I was thinking if we wanted to avoid the expensive copy for 
inserting when position < idx, we could switch from using an int[] to an 
ArrayList to represent nodesByLevel for a particular level. This 
should avoid in most cases the extreme copy that out of order insertion would 
require.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jmazanec15 commented on a diff in pull request #12050: Reuse HNSW graph for intialization during merge

2023-01-11 Thread GitBox


jmazanec15 commented on code in PR #12050:
URL: https://github.com/apache/lucene/pull/12050#discussion_r1067742826


##
lucene/core/src/java/org/apache/lucene/util/hnsw/OnHeapHnswGraph.java:
##
@@ -94,36 +93,83 @@ public int size() {
   }
 
   /**
-   * Add node on the given level
+   * Add node on the given level. Nodes can be inserted out of order, but it 
requires that the nodes

Review Comment:
   That makes sense. This shouldnt be a big problem.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jmazanec15 commented on a diff in pull request #12050: Reuse HNSW graph for intialization during merge

2023-01-05 Thread GitBox


jmazanec15 commented on code in PR #12050:
URL: https://github.com/apache/lucene/pull/12050#discussion_r1062966074


##
lucene/core/src/java/org/apache/lucene/util/hnsw/OnHeapHnswGraph.java:
##
@@ -94,36 +93,83 @@ public int size() {
   }
 
   /**
-   * Add node on the given level
+   * Add node on the given level. Nodes can be inserted out of order, but it 
requires that the nodes

Review Comment:
   > but still because in L156 we need to copy the rest of array again and 
again as long as that is a non-appending action
   
   Right, this could be expensive for out of order insertion. I can try 
switching the nodeByLevel int array to a TreeSet and compare performance to 
https://github.com/apache/lucene/issues/11354.
   
   One complication with this approach is that the NodesIterator expects an int 
array: 
https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraph.java#L134.
 Given this is a public interface, we might need to either convert the treeset 
to an int array every time 
[getNodesOnLevel](https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/util/hnsw/OnHeapHnswGraph.java#L165)
 gets called, or alter the NodesIterator interface to support both an int array 
and an Iterator produced from the TreeSet.
   
   @zhaih What do you think of this approach? Is there better way to do this?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jmazanec15 commented on a diff in pull request #12050: Reuse HNSW graph for intialization during merge

2023-01-04 Thread GitBox


jmazanec15 commented on code in PR #12050:
URL: https://github.com/apache/lucene/pull/12050#discussion_r1061991840


##
lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraphBuilder.java:
##
@@ -143,10 +148,64 @@ public OnHeapHnswGraph build(RandomAccessVectorValues 
vectorsToAdd) throws IOExc
 return hnsw;
   }
 
+  /**
+   * Initializes the graph of this builder. Transfers the nodes and their 
neighbors from the
+   * initializer graph into the graph being produced by this builder, mapping 
ordinals from the
+   * initializer graph to their new ordinals in this builder's graph. The 
builder's graph must be
+   * empty before calling this method.
+   *
+   * @param initializerGraph graph used for initialization
+   * @param oldToNewOrdinalMap map for converting from ordinals in the 
initializerGraph to this
+   * builder's graph
+   */
+  public void initializeFromGraph(
+  HnswGraph initializerGraph, Map oldToNewOrdinalMap) 
throws IOException {
+assert hnsw.size() == 0;
+float[] vectorValue = null;
+BytesRef binaryValue = null;
+for (int level = 0; level < initializerGraph.numLevels(); level++) {
+  HnswGraph.NodesIterator it = initializerGraph.getNodesOnLevel(level);
+
+  while (it.hasNext()) {
+int oldOrd = it.nextInt();
+int newOrd = oldToNewOrdinalMap.get(oldOrd);
+
+hnsw.addNode(level, newOrd);
+
+if (level == 0) {
+  initializedNodes.add(newOrd);
+}
+
+switch (this.vectorEncoding) {
+  case FLOAT32 -> vectorValue = vectors.vectorValue(newOrd);
+  case BYTE -> binaryValue = vectors.binaryValue(newOrd);
+}
+
+NeighborArray newNeighbors = this.hnsw.getNeighbors(level, newOrd);
+initializerGraph.seek(level, oldOrd);
+for (int oldNeighbor = initializerGraph.nextNeighbor();
+oldNeighbor != NO_MORE_DOCS;
+oldNeighbor = initializerGraph.nextNeighbor()) {
+  int newNeighbor = oldToNewOrdinalMap.get(oldNeighbor);
+  float score =
+  switch (this.vectorEncoding) {
+case FLOAT32 -> this.similarityFunction.compare(

Review Comment:
   > Is this sorted order only used for calculating diversity easier?
   
   Yes, I think you are correct. The  reason for sorting order by distance 
during construction is that the neighbor arrays of an inserted node continue to 
get updated as more nodes are inserted in. So keeping it sorted will allow the 
worst node or nodes will allow it to be more easily identified. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jmazanec15 commented on a diff in pull request #12050: Reuse HNSW graph for intialization during merge

2023-01-04 Thread GitBox


jmazanec15 commented on code in PR #12050:
URL: https://github.com/apache/lucene/pull/12050#discussion_r1061916872


##
lucene/core/src/java/org/apache/lucene/util/hnsw/OnHeapHnswGraph.java:
##
@@ -94,36 +93,83 @@ public int size() {
   }
 
   /**
-   * Add node on the given level
+   * Add node on the given level. Nodes can be inserted out of order, but it 
requires that the nodes

Review Comment:
   Im not sure on this. For random insertion for the graph, I think a BST would 
be better. 
   
   However, the insertion pattern for merge typically wont be random. It will 
be more like first, nodes [15-73] are inserted, and then nodes [0-14] and then 
nodes [74-100]. This assumes that the MergedVectorValues are a concatenation of 
the segments to be merged vectors. For this, I added the small optimization of 
the "lastAddedPosInLayer" list, which will skip binary search during locally 
ordered insertion.
   
   That being said, I am not sure that the "concatenation" property is 
guaranteed, specifically in the case when the mergeState.needsIndexSort == 
true. Given this case, it seems like insertion pattern might be more random.
   
   All that being said, do you think it would be better to build for the 
concatenation pattern or more random insert pattern?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jmazanec15 commented on a diff in pull request #12050: Reuse HNSW graph for intialization during merge

2023-01-04 Thread GitBox


jmazanec15 commented on code in PR #12050:
URL: https://github.com/apache/lucene/pull/12050#discussion_r1061905607


##
lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraphBuilder.java:
##
@@ -143,10 +148,64 @@ public OnHeapHnswGraph build(RandomAccessVectorValues 
vectorsToAdd) throws IOExc
 return hnsw;
   }
 
+  /**
+   * Initializes the graph of this builder. Transfers the nodes and their 
neighbors from the
+   * initializer graph into the graph being produced by this builder, mapping 
ordinals from the
+   * initializer graph to their new ordinals in this builder's graph. The 
builder's graph must be
+   * empty before calling this method.
+   *
+   * @param initializerGraph graph used for initialization
+   * @param oldToNewOrdinalMap map for converting from ordinals in the 
initializerGraph to this
+   * builder's graph
+   */
+  public void initializeFromGraph(
+  HnswGraph initializerGraph, Map oldToNewOrdinalMap) 
throws IOException {
+assert hnsw.size() == 0;
+float[] vectorValue = null;
+BytesRef binaryValue = null;
+for (int level = 0; level < initializerGraph.numLevels(); level++) {
+  HnswGraph.NodesIterator it = initializerGraph.getNodesOnLevel(level);
+
+  while (it.hasNext()) {
+int oldOrd = it.nextInt();
+int newOrd = oldToNewOrdinalMap.get(oldOrd);
+
+hnsw.addNode(level, newOrd);
+
+if (level == 0) {
+  initializedNodes.add(newOrd);
+}
+
+switch (this.vectorEncoding) {
+  case FLOAT32 -> vectorValue = vectors.vectorValue(newOrd);
+  case BYTE -> binaryValue = vectors.binaryValue(newOrd);
+}
+
+NeighborArray newNeighbors = this.hnsw.getNeighbors(level, newOrd);
+initializerGraph.seek(level, oldOrd);
+for (int oldNeighbor = initializerGraph.nextNeighbor();
+oldNeighbor != NO_MORE_DOCS;
+oldNeighbor = initializerGraph.nextNeighbor()) {
+  int newNeighbor = oldToNewOrdinalMap.get(oldNeighbor);
+  float score =
+  switch (this.vectorEncoding) {
+case FLOAT32 -> this.similarityFunction.compare(

Review Comment:
   Ah I thought about this, but the HnswGraph does not guarantee ordering of 
neighbors. For example, during writing, the neighbors get sorted to improve 
compression: 
https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/codecs/lucene95/Lucene95HnswVectorsWriter.java#L486-L493.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jmazanec15 commented on a diff in pull request #12050: Reuse HNSW graph for intialization during merge

2023-01-04 Thread GitBox


jmazanec15 commented on code in PR #12050:
URL: https://github.com/apache/lucene/pull/12050#discussion_r1061902971


##
lucene/core/src/java/org/apache/lucene/codecs/lucene95/Lucene95HnswVectorsWriter.java:
##
@@ -461,6 +467,126 @@ public void mergeOneField(FieldInfo fieldInfo, MergeState 
mergeState) throws IOE
 }
   }
 
+  private void maybeInitializeFromGraph(
+  HnswGraphBuilder hnswGraphBuilder, MergeState mergeState, FieldInfo 
fieldInfo)
+  throws IOException {
+int initializerIndex = selectGraphForInitialization(mergeState, fieldInfo);
+if (initializerIndex == -1) {
+  return;
+}
+
+HnswGraph initializerGraph =
+getHnswGraphFromReader(fieldInfo.name, 
mergeState.knnVectorsReaders[initializerIndex]);
+Map ordinalMapper =
+getOldToNewOrdinalMap(mergeState, fieldInfo, initializerIndex);
+hnswGraphBuilder.initializeFromGraph(initializerGraph, ordinalMapper);
+  }
+
+  private int selectGraphForInitialization(MergeState mergeState, FieldInfo 
fieldInfo)
+  throws IOException {
+// Find the KnnVectorReader with the most docs that meets the following 
criteria:
+//  1. Does not contain any deleted docs
+//  2. Is a Lucene95HnswVectorsReader/PerFieldKnnVectorReader
+// If no readers exist that meet this criteria, return -1. If they do, 
return their index in
+// merge state
+int maxCandidateVectorCount = 0;
+int initializerIndex = -1;
+
+for (int i = 0; i < mergeState.liveDocs.length; i++) {
+  KnnVectorsReader currKnnVectorsReader = mergeState.knnVectorsReaders[i];
+  if (mergeState.knnVectorsReaders[i]
+  instanceof PerFieldKnnVectorsFormat.FieldsReader candidateReader) {
+currKnnVectorsReader = candidateReader.getFieldReader(fieldInfo.name);
+  }
+
+  if (!allMatch(mergeState.liveDocs[i])
+  || !(currKnnVectorsReader instanceof Lucene95HnswVectorsReader 
candidateReader)) {
+continue;
+  }
+
+  VectorValues vectorValues = 
candidateReader.getVectorValues(fieldInfo.name);
+  if (vectorValues == null) {
+continue;
+  }
+
+  int candidateVectorCount = vectorValues.size();
+  if (candidateVectorCount > maxCandidateVectorCount) {
+maxCandidateVectorCount = candidateVectorCount;
+initializerIndex = i;
+  }
+}
+return initializerIndex;
+  }
+
+  private HnswGraph getHnswGraphFromReader(String fieldName, KnnVectorsReader 
knnVectorsReader)
+  throws IOException {
+if (knnVectorsReader instanceof PerFieldKnnVectorsFormat.FieldsReader 
perFieldReader
+&& perFieldReader.getFieldReader(fieldName)
+instanceof Lucene95HnswVectorsReader fieldReader) {
+  return fieldReader.getGraph(fieldName);
+}
+
+if (knnVectorsReader instanceof Lucene95HnswVectorsReader) {
+  return ((Lucene95HnswVectorsReader) 
knnVectorsReader).getGraph(fieldName);
+}
+
+throw new IllegalArgumentException(
+"Invalid KnnVectorsReader. Must be of type 
PerFieldKnnVectorsFormat.FieldsReader or Lucene94HnswVectorsReader");
+  }
+
+  private Map getOldToNewOrdinalMap(
+  MergeState mergeState, FieldInfo fieldInfo, int initializerIndex) throws 
IOException {
+VectorValues initializerVectorValues =
+
mergeState.knnVectorsReaders[initializerIndex].getVectorValues(fieldInfo.name);
+MergeState.DocMap initializerDocMap = mergeState.docMaps[initializerIndex];
+
+Map newIdToOldOrdinal = new HashMap<>();
+int oldOrd = 0;
+for (int oldId = initializerVectorValues.nextDoc();
+oldId != NO_MORE_DOCS;
+oldId = initializerVectorValues.nextDoc()) {
+  if (initializerVectorValues.vectorValue() == null) {
+continue;
+  }
+  int newId = initializerDocMap.get(oldId);
+  newIdToOldOrdinal.put(newId, oldOrd);
+  oldOrd++;
+}
+
+Map oldToNewOrdinalMap = new HashMap<>();
+int newOrd = 0;
+int maxNewDocID = Collections.max(newIdToOldOrdinal.keySet());

Review Comment:
   Good idea, I will update this.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jmazanec15 commented on a diff in pull request #12050: Reuse HNSW graph for intialization during merge

2023-01-04 Thread GitBox


jmazanec15 commented on code in PR #12050:
URL: https://github.com/apache/lucene/pull/12050#discussion_r1061902422


##
lucene/core/src/java/org/apache/lucene/codecs/lucene95/Lucene95HnswVectorsWriter.java:
##
@@ -461,6 +467,126 @@ public void mergeOneField(FieldInfo fieldInfo, MergeState 
mergeState) throws IOE
 }
   }
 
+  private void maybeInitializeFromGraph(
+  HnswGraphBuilder hnswGraphBuilder, MergeState mergeState, FieldInfo 
fieldInfo)
+  throws IOException {
+int initializerIndex = selectGraphForInitialization(mergeState, fieldInfo);
+if (initializerIndex == -1) {
+  return;
+}
+
+HnswGraph initializerGraph =
+getHnswGraphFromReader(fieldInfo.name, 
mergeState.knnVectorsReaders[initializerIndex]);
+Map ordinalMapper =
+getOldToNewOrdinalMap(mergeState, fieldInfo, initializerIndex);
+hnswGraphBuilder.initializeFromGraph(initializerGraph, ordinalMapper);
+  }
+
+  private int selectGraphForInitialization(MergeState mergeState, FieldInfo 
fieldInfo)
+  throws IOException {
+// Find the KnnVectorReader with the most docs that meets the following 
criteria:
+//  1. Does not contain any deleted docs
+//  2. Is a Lucene95HnswVectorsReader/PerFieldKnnVectorReader
+// If no readers exist that meet this criteria, return -1. If they do, 
return their index in
+// merge state
+int maxCandidateVectorCount = 0;
+int initializerIndex = -1;
+
+for (int i = 0; i < mergeState.liveDocs.length; i++) {
+  KnnVectorsReader currKnnVectorsReader = mergeState.knnVectorsReaders[i];
+  if (mergeState.knnVectorsReaders[i]
+  instanceof PerFieldKnnVectorsFormat.FieldsReader candidateReader) {
+currKnnVectorsReader = candidateReader.getFieldReader(fieldInfo.name);
+  }
+
+  if (!allMatch(mergeState.liveDocs[i])
+  || !(currKnnVectorsReader instanceof Lucene95HnswVectorsReader 
candidateReader)) {
+continue;
+  }
+
+  VectorValues vectorValues = 
candidateReader.getVectorValues(fieldInfo.name);
+  if (vectorValues == null) {
+continue;
+  }
+
+  int candidateVectorCount = vectorValues.size();
+  if (candidateVectorCount > maxCandidateVectorCount) {
+maxCandidateVectorCount = candidateVectorCount;
+initializerIndex = i;
+  }
+}
+return initializerIndex;
+  }
+
+  private HnswGraph getHnswGraphFromReader(String fieldName, KnnVectorsReader 
knnVectorsReader)
+  throws IOException {
+if (knnVectorsReader instanceof PerFieldKnnVectorsFormat.FieldsReader 
perFieldReader
+&& perFieldReader.getFieldReader(fieldName)
+instanceof Lucene95HnswVectorsReader fieldReader) {
+  return fieldReader.getGraph(fieldName);
+}
+
+if (knnVectorsReader instanceof Lucene95HnswVectorsReader) {
+  return ((Lucene95HnswVectorsReader) 
knnVectorsReader).getGraph(fieldName);
+}
+

Review Comment:
   Good idea, I will add this comment here



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jmazanec15 commented on a diff in pull request #12050: Reuse HNSW graph for intialization during merge

2023-01-04 Thread GitBox


jmazanec15 commented on code in PR #12050:
URL: https://github.com/apache/lucene/pull/12050#discussion_r1061902120


##
lucene/core/src/java/org/apache/lucene/codecs/lucene95/Lucene95HnswVectorsWriter.java:
##
@@ -461,6 +467,126 @@ public void mergeOneField(FieldInfo fieldInfo, MergeState 
mergeState) throws IOE
 }
   }
 
+  private void maybeInitializeFromGraph(
+  HnswGraphBuilder hnswGraphBuilder, MergeState mergeState, FieldInfo 
fieldInfo)
+  throws IOException {
+int initializerIndex = selectGraphForInitialization(mergeState, fieldInfo);
+if (initializerIndex == -1) {
+  return;
+}
+
+HnswGraph initializerGraph =
+getHnswGraphFromReader(fieldInfo.name, 
mergeState.knnVectorsReaders[initializerIndex]);
+Map ordinalMapper =
+getOldToNewOrdinalMap(mergeState, fieldInfo, initializerIndex);
+hnswGraphBuilder.initializeFromGraph(initializerGraph, ordinalMapper);
+  }
+
+  private int selectGraphForInitialization(MergeState mergeState, FieldInfo 
fieldInfo)
+  throws IOException {
+// Find the KnnVectorReader with the most docs that meets the following 
criteria:
+//  1. Does not contain any deleted docs
+//  2. Is a Lucene95HnswVectorsReader/PerFieldKnnVectorReader
+// If no readers exist that meet this criteria, return -1. If they do, 
return their index in
+// merge state
+int maxCandidateVectorCount = 0;
+int initializerIndex = -1;
+
+for (int i = 0; i < mergeState.liveDocs.length; i++) {
+  KnnVectorsReader currKnnVectorsReader = mergeState.knnVectorsReaders[i];
+  if (mergeState.knnVectorsReaders[i]
+  instanceof PerFieldKnnVectorsFormat.FieldsReader candidateReader) {
+currKnnVectorsReader = candidateReader.getFieldReader(fieldInfo.name);
+  }
+
+  if (!allMatch(mergeState.liveDocs[i])
+  || !(currKnnVectorsReader instanceof Lucene95HnswVectorsReader 
candidateReader)) {
+continue;
+  }
+
+  VectorValues vectorValues = 
candidateReader.getVectorValues(fieldInfo.name);
+  if (vectorValues == null) {
+continue;
+  }
+
+  int candidateVectorCount = vectorValues.size();
+  if (candidateVectorCount > maxCandidateVectorCount) {
+maxCandidateVectorCount = candidateVectorCount;
+initializerIndex = i;
+  }
+}
+return initializerIndex;
+  }
+
+  private HnswGraph getHnswGraphFromReader(String fieldName, KnnVectorsReader 
knnVectorsReader)
+  throws IOException {
+if (knnVectorsReader instanceof PerFieldKnnVectorsFormat.FieldsReader 
perFieldReader
+&& perFieldReader.getFieldReader(fieldName)
+instanceof Lucene95HnswVectorsReader fieldReader) {
+  return fieldReader.getGraph(fieldName);
+}
+
+if (knnVectorsReader instanceof Lucene95HnswVectorsReader) {
+  return ((Lucene95HnswVectorsReader) 
knnVectorsReader).getGraph(fieldName);
+}
+
+throw new IllegalArgumentException(
+"Invalid KnnVectorsReader. Must be of type 
PerFieldKnnVectorsFormat.FieldsReader or Lucene94HnswVectorsReader");

Review Comment:
   Makes sense. Will update.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org