Re: Disconnectedness in HNSW graphs in Lucene
Nitiraj, Good experimentation! Connectedness within layers is indeed important. The algorithm itself should ensure connectedness of disjoint NSWs as it mutually connects nodes (selected over diversity). However, if the data is extremely clustered, this can cause connectedness to drop (few densely packed clusters may not connect to other densely packed clusters). For your extreme examples, is the data densely clustered? What would you suggest as an improvement in Lucene regarding the algorithm implementation? An interesting experiment would be to see if `hnswlib` has the same connected issues if it indexes the same vectors in the same order. Thanks! Ben Trent On Wed, Aug 23, 2023 at 5:07 AM Nitiraj Singh Rathore < nitiraj.rath...@gmail.com> wrote: > Hi Lucene developers, > > I work for Amazon Retail Product search and we are using Lucene KNN for > semantic search of products. We index product embeddings (vectors) into > lucene (hnsw graph) and search them by generating query embedding at > runtime. The product embeddings also receive regular updates and the index > geometry keeps changing because of merges. > We recently noticed that the hnsw graphs generated are not always strongly > connected and in worst case scenario some products may be undiscoverable. > Connectedness of Hierarchical graph can be complicated, so below I am > mentioning my experiment details. > > - Experiment: > I took the Lucene indexes from our production servers and for each segment > (hnsw graph) I did following test. > At each level graph I took the same entry point, the entry point of HNSW > graph, checked how many nodes are reachable from this entrypoint. Note that > connectedness at each level was checked independently of other levels. > Sample code attached. My observations are as below. > > - Observation : > 1. Of all the graphs across all the segments, across 100s of indexes > that I considered, one graph for each "level" of HNSW, almost 18% of the > graphs had some disconnectedness. > 2. Disconnectedness was observed at all the levels of HNSW graphs. We have > at most 3 levels in HNSW graphs. > 3. percentage disconnectedness ranged from small fractions 0.000386% (1 > disconnected out of 259342) to 3.7% (eg. 87 disconnected out of 2308). > In some extreme case the entry-point in zeroth level graph was > disconnected from rest of the graph making the %age disconnected as high as > 99.9% > (65 reachable nodes from EP out of 252275). But this does not necessarily > mean that the 99.9% of nodes were not discoverable, it just means that if > unluckily we end up on EP in the 0th level graph for a query, there can at > max be 65 nodes that can be reached. But had we diverted our path from EP > to some other node in the upper level graphs then may be more nodes be > discoverable via that node. > > - What I Not Checked : > What I have not checked till now is the connectedness for the whole HNSW > graph including edges of all the levels. > Also, I have not checked the number of disconnected components in a graph. > I have just checked the number of connected nodes to the entry-point. > > But irrespective of that, I think graphs should be strongly connected at > each level and disconnectedness if at all should be very very rare. > > Thanks Kaival Parikh for discovering the issue in the first place and the > script for checking connectedness. > > What do others think about this? > > public class CheckHNSWConnectedness { > private static int getReachableNodes(HnswGraph graph, int level) throws > IOException { > Set visited = new HashSet<>(); > Stack candidates = new Stack<>(); > candidates.push(graph.entryNode()); > > while (!candidates.isEmpty()) { > int node = candidates.pop(); > > if (visited.contains(node)) { > continue; > } > > visited.add(node); > graph.seek(level, node); > > int friendOrd; > while ((friendOrd = graph.nextNeighbor()) != NO_MORE_DOCS) { > candidates.push(friendOrd); > } > } > return visited.size(); > } > > public static void checkConnected(String index, String hnswField) throws > IOException, NoSuchFieldException, IllegalAccessException { > try (FSDirectory dir = FSDirectory.open(Paths.get(index)); > IndexReader indexReader = DirectoryReader.open(dir)) { > for (LeafReaderContext ctx : indexReader.leaves() ) { > KnnVectorsReader reader = > ((PerFieldKnnVectorsFormat.FieldsReader) ((SegmentReader) > ctx.reader()).getVectorReader()).getFieldReader(hnswField); > > if (reader != null) { > HnswGraph graph = ((Lucene95HnswVectorsReader) > reader).getGraph(hnswField); > for (int l = 0; l < graph.numLevels(); l++){ > int reachableNodes = getReachableNodes(graph, l); > //
Re: 8.11.3 release
Hi Jan, Yes, still targeting September. But I will slip on my initial plan of doing it by first week of September. I'm foreseeing mid September timeframe. Thanks for checking in. Regards, Ishan On Wed, 23 Aug, 2023, 5:05 pm Jan Høydahl, wrote: > Hi, > > Following up on Ishan's proposed 8.11.3 release ( > https://lists.apache.org/thread/3xjtv1sxqx8f9nvhkc0cb90b2p76nfx2) > > Does the Lucene project have any bugfix candidates for backporting? > > Ishan, are you still targeting September? > > Jan > > > > 1. aug. 2023 kl. 14:57 skrev Ishan Chattopadhyaya < > ichattopadhy...@gmail.com>: > > > > Oh yes, good idea. Forgot about the split! > > > > +Lucene Dev > > > > On Tue, 1 Aug, 2023, 6:17 pm Uwe Schindler, wrote: > > > >> Maybe ask on Lucene list, too, if there are some bug people like to have > >> fixed in Lucene. > >> > >> Uwe > >> > >> Am 01.08.2023 um 11:10 schrieb Ishan Chattopadhyaya: > >>> Hi all, > >>> There have been lots of bug fixes that have gone into 9x that should > >>> benefit all 8x users as well. > >>> I thought of volunteering for such a 8.x release based on this comment > >> [0]. > >>> > >>> Unless someone has any objections or concerns, can we tentatively plan > >> 1st > >>> September 2023 (1 month from now) as a tentative release for 8.11.3? I > >>> think we will get ample time to backport relevant fixes to 8x by then. > >>> > >>> Best regards, > >>> Ishan > >>> > >>> [0] - > >>> > >> > https://issues.apache.org/jira/browse/SOLR-16777?focusedCommentId=17742854=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17742854 > >>> > >> -- > >> Uwe Schindler > >> Achterdiek 19, D-28357 Bremen > >> https://www.thetaphi.de > >> eMail: u...@thetaphi.de > >> > >> > >> - > >> To unsubscribe, e-mail: dev-unsubscr...@solr.apache.org > >> For additional commands, e-mail: dev-h...@solr.apache.org > >> > >> > > > - > To unsubscribe, e-mail: dev-unsubscr...@solr.apache.org > For additional commands, e-mail: dev-h...@solr.apache.org > >
Re: 8.11.3 release
Hi, Following up on Ishan's proposed 8.11.3 release (https://lists.apache.org/thread/3xjtv1sxqx8f9nvhkc0cb90b2p76nfx2) Does the Lucene project have any bugfix candidates for backporting? Ishan, are you still targeting September? Jan > 1. aug. 2023 kl. 14:57 skrev Ishan Chattopadhyaya : > > Oh yes, good idea. Forgot about the split! > > +Lucene Dev > > On Tue, 1 Aug, 2023, 6:17 pm Uwe Schindler, wrote: > >> Maybe ask on Lucene list, too, if there are some bug people like to have >> fixed in Lucene. >> >> Uwe >> >> Am 01.08.2023 um 11:10 schrieb Ishan Chattopadhyaya: >>> Hi all, >>> There have been lots of bug fixes that have gone into 9x that should >>> benefit all 8x users as well. >>> I thought of volunteering for such a 8.x release based on this comment >> [0]. >>> >>> Unless someone has any objections or concerns, can we tentatively plan >> 1st >>> September 2023 (1 month from now) as a tentative release for 8.11.3? I >>> think we will get ample time to backport relevant fixes to 8x by then. >>> >>> Best regards, >>> Ishan >>> >>> [0] - >>> >> https://issues.apache.org/jira/browse/SOLR-16777?focusedCommentId=17742854=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17742854 >>> >> -- >> Uwe Schindler >> Achterdiek 19, D-28357 Bremen >> https://www.thetaphi.de >> eMail: u...@thetaphi.de >> >> >> - >> To unsubscribe, e-mail: dev-unsubscr...@solr.apache.org >> For additional commands, e-mail: dev-h...@solr.apache.org >> >> - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Disconnectedness in HNSW graphs in Lucene
Hi Lucene developers, I work for Amazon Retail Product search and we are using Lucene KNN for semantic search of products. We index product embeddings (vectors) into lucene (hnsw graph) and search them by generating query embedding at runtime. The product embeddings also receive regular updates and the index geometry keeps changing because of merges. We recently noticed that the hnsw graphs generated are not always strongly connected and in worst case scenario some products may be undiscoverable. Connectedness of Hierarchical graph can be complicated, so below I am mentioning my experiment details. - Experiment: I took the Lucene indexes from our production servers and for each segment (hnsw graph) I did following test. At each level graph I took the same entry point, the entry point of HNSW graph, checked how many nodes are reachable from this entrypoint. Note that connectedness at each level was checked independently of other levels. Sample code attached. My observations are as below. - Observation : 1. Of all the graphs across all the segments, across 100s of indexes that I considered, one graph for each "level" of HNSW, almost 18% of the graphs had some disconnectedness. 2. Disconnectedness was observed at all the levels of HNSW graphs. We have at most 3 levels in HNSW graphs. 3. percentage disconnectedness ranged from small fractions 0.000386% (1 disconnected out of 259342) to 3.7% (eg. 87 disconnected out of 2308). In some extreme case the entry-point in zeroth level graph was disconnected from rest of the graph making the %age disconnected as high as 99.9% (65 reachable nodes from EP out of 252275). But this does not necessarily mean that the 99.9% of nodes were not discoverable, it just means that if unluckily we end up on EP in the 0th level graph for a query, there can at max be 65 nodes that can be reached. But had we diverted our path from EP to some other node in the upper level graphs then may be more nodes be discoverable via that node. - What I Not Checked : What I have not checked till now is the connectedness for the whole HNSW graph including edges of all the levels. Also, I have not checked the number of disconnected components in a graph. I have just checked the number of connected nodes to the entry-point. But irrespective of that, I think graphs should be strongly connected at each level and disconnectedness if at all should be very very rare. Thanks Kaival Parikh for discovering the issue in the first place and the script for checking connectedness. What do others think about this? public class CheckHNSWConnectedness { private static int getReachableNodes(HnswGraph graph, int level) throws IOException { Set visited = new HashSet<>(); Stack candidates = new Stack<>(); candidates.push(graph.entryNode()); while (!candidates.isEmpty()) { int node = candidates.pop(); if (visited.contains(node)) { continue; } visited.add(node); graph.seek(level, node); int friendOrd; while ((friendOrd = graph.nextNeighbor()) != NO_MORE_DOCS) { candidates.push(friendOrd); } } return visited.size(); } public static void checkConnected(String index, String hnswField) throws IOException, NoSuchFieldException, IllegalAccessException { try (FSDirectory dir = FSDirectory.open(Paths.get(index)); IndexReader indexReader = DirectoryReader.open(dir)) { for (LeafReaderContext ctx : indexReader.leaves() ) { KnnVectorsReader reader = ((PerFieldKnnVectorsFormat.FieldsReader) ((SegmentReader) ctx.reader()).getVectorReader()).getFieldReader(hnswField); if (reader != null) { HnswGraph graph = ((Lucene95HnswVectorsReader) reader).getGraph(hnswField); for (int l = 0; l < graph.numLevels(); l++){ int reachableNodes = getReachableNodes(graph, l); // int graphSize = graph.size(); // this gives nodes at 0th level int graphSize = graph.getNodesOnLevel(l).size(); System.out.printf("Level: %d, Actual Size: %d, Reachable: %d, Not Reachable : %d\n", l, graphSize, reachableNodes, (graphSize - reachableNodes)); } } } } } public static void main(String[] args) throws IOException, NoSuchFieldException, IllegalAccessException { String index = args[0]; String field = args[1]; System.out.println("For index " + index + " field : " + field); checkConnected(index, field); } -- Regards, Nitiraj