Hi Derek,

I have been trying a few settings with HNSW in Lucene/SOLR, and whilst my
experiences may not be directly relevant to you, they may provide some
background.

My tests have been with an index of up to 160M records containing a 512
element byte embedding.  The  original embeddings were of text articles
(average length about 450 words) generated by openAI's ada-002 as 1536
floats, but then encoding as 512 bytes by encoding groups of 3 floats as 1
byte using PQ encoding using the method described here:
https://lear.inrialpes.fr/pubs/2011/JDS11/jegou_searching_with_quantization.pdf
The motivation for PQ encoding is basically to reduce index size.  A first
attempt at encoding the floats as bytes worked well (I tried to to minimise
error by analysing the distribution of float values across the 1536
dimensions, and noticed that all but 5 of the dimensions had a very
narrow range for most embeddings, and so using k-means clustering to find
256 values for those dimensions, and another 256 values for the 5 "outlier"
dimensions yielded good results).  However, each vector still occupied 1536
bytes, and HNSW really needs these to be in memory as otherwise  the IOs to
even the RAID 10 nvme devices connected to their own PCIE3 lanes will cause
slow query rates.  So quantising 3 floats into 1 byte was attractive.
Again, I used k-means on each of the 512 x 3 byte groups to get 256
"centroids" to minimise error.  The downside of this approach is the need
to define a custom similarity that reads at initiation the 512 centroid
tables (each with 256 mappings to expand a byte code to 3 floating-point
numbers representing a "centroid" point).

Anyway, the loss caused by this mapping is real but not particularly
consequential: some result lists are slightly degraded/reordered, but HNSW
is an "approximate nearest neighbour" search anyway.

How sure are you that the unexpected search results you are reporting are
caused by the HNSW ANN rather than the encoding?  For example, if you run
an exhaustive search on your 2m records to find the "real" nearest
neighbours to some point representing some base document, how do the
results differ from your HNSW search with various search beamwidths
(provided as the "k" parameter on the KnnByteVectorQuery constructor)?

Although not directly relevant to your use-case, results I'm seeing on an
index of 160M documents with a ada-002 embedding quantised to 512 bytes
using a recent (11Feb23) Lucene built with a "M" of 64 and a construction
"beamwidth" of 120 and with a custom similarity:

with a  search "k" of 1, the "real" closest match is returned 56% of the
time and requires 18K similarity comparisons.
with a search "k" of 2, the "real" closest match us returned as the top
match 61% of the time and requires 22K comparisons
with "k" of 3, 64%, 24K comparisons
"k" of 5, 70%, 29K
"k" of 10, 78%, 37K
"k" of 20, 87%, 48K
"k" of 50, 94%, 63K
"k" of 120, 97%, 121K

The nature of the embeddings I loaded is that many are very similar
(basically, randomish variations on a much smaller set of "base" articles,
as we couldnt afford to get embeddings for 160M articles for this test - we
are just trying to test whether Lucene's HNSW is feasible for our
use-case), so in the overwhelming majority of "misses", the top article is
indeed very similar to the article sought.  That is, for our use case, the
results are satisfactory, even with the "down-scaling" of the embedding to
512 bytes.

best regards

Kent Fitch



On Mon, Feb 27, 2023 at 5:02 AM Derek C <de...@hssl.ie> wrote:

> Hi all,
>
> I'm a bit uncertain how KNN with HNSW works in SOLR with dense vector
> fields and searching.
>
> Recently I've been doing tests loading dense vectors after inferencing
> [images] and then checking by eye the closest matches and the results look
> funny (very similar images not being the nearest results as I'd normally
> expect).
>
> I'm unclear about HNSW in general (like what are the best policies, or a
> good guide or starting point, for choosing hnswMaxConnections and
> hnswBeamWidth values if you know the dense vector length (512) and you know
> you have 2 million+ documents).
>
> But one thing I'm wondering right now is with a dataset over time, where
> documents have been added and documents have been removed over time, can
> this affect the KNN search (i.e. is it better if all documents, or at least
> the dense vector field, had be indexed fresh) ?
>
> BTW I haven't yet moved from SOLR 9.0 to 9.1 but I do read that the HNSW
> codec has changed in some way so a reindex is required - I should probably
> try 9.1 (I would prioritise this if anyone says 9.1 is better quality or
> better performance for KNN searches!).
>
> Thanks for any info!
>
> Derek
>
> --
> Derek Conniffe
> Harvey Software Systems Ltd T/A HSSL
> Telephone (IRL): 086 856 3823
> Telephone (US): (650) 449 6044
> Skype: dconnrt
> Email: de...@hssl.ie
>
>
> *Disclaimer:* This email and any files transmitted with it are confidential
> and intended solely for the use of the individual or entity to whom they
> are addressed. If you have received this email in error please delete it
> (if you are not the intended recipient you are notified that disclosing,
> copying, distributing or taking any action in reliance on the contents of
> this information is strictly prohibited).
> *Warning*: Although HSSL have taken reasonable precautions to ensure no
> viruses are present in this email, HSSL cannot accept responsibility for
> any loss or damage arising from the use of this email or attachments.
> P For the Environment, please only print this email if necessary.
>

Reply via email to