Hi spark users,

A few years back I created a java implementation of the hnsw algorithm in
my spare time. Hnsw is an algorithm to do k-nearest neighbour search. Or as
as people tend to refer to it now: vector search

It can can be used to implement things like recommendation systems, image
search, retrieval augmented generation (RAG)

At the time one of the data scientists in the team I was working in asked
me to create a distributed version of it on top of pyspark. I did and it
eventually even found its way into production codebases, powering our
recommendation systems.

But the integration never felt elegant or natural. By its nature the hnsw
algorithm uses many threads which does not fit the 1 core per task model of
spark well.

You could configure the library to use many threads but not update your
spark configuration. If your cluster did not use cgroups this would work
but you'd be a burden to all other jobs running on the cluster

You could set spark.executor.core to a to the number of threads your index
is using. But that meant that all other tasks running in the same job would
also use that many cores. even if they did not require it. This made it
particularly inconvenient to use from notebooks.

Version 2 is a large overhaul of the library that takes advantage of stage
level scheduling and does away with these issues.

A lot could still be done to improve the performance, especially compared
to native implementations. But I think it should be fast enough for many
use cases and the ease of use on spark makes up for some of its
shortcomings.

I’d love for you to try it out and share your feedback both positive and
negative. Contributions, bug reports, and feature requests are always
welcome on GitHub.

The documentation can be found here:

https://jelmerk.github.io/hnswlib-spark/

Reply via email to