Hi spark users, A few years back I created a java implementation of the hnsw algorithm in my spare time. Hnsw is an algorithm to do k-nearest neighbour search. Or as as people tend to refer to it now: vector search
It can can be used to implement things like recommendation systems, image search, retrieval augmented generation (RAG) At the time one of the data scientists in the team I was working in asked me to create a distributed version of it on top of pyspark. I did and it eventually even found its way into production codebases, powering our recommendation systems. But the integration never felt elegant or natural. By its nature the hnsw algorithm uses many threads which does not fit the 1 core per task model of spark well. You could configure the library to use many threads but not update your spark configuration. If your cluster did not use cgroups this would work but you'd be a burden to all other jobs running on the cluster You could set spark.executor.core to a to the number of threads your index is using. But that meant that all other tasks running in the same job would also use that many cores. even if they did not require it. This made it particularly inconvenient to use from notebooks. Version 2 is a large overhaul of the library that takes advantage of stage level scheduling and does away with these issues. A lot could still be done to improve the performance, especially compared to native implementations. But I think it should be fast enough for many use cases and the ease of use on spark makes up for some of its shortcomings. I’d love for you to try it out and share your feedback both positive and negative. Contributions, bug reports, and feature requests are always welcome on GitHub. The documentation can be found here: https://jelmerk.github.io/hnswlib-spark/