Trey Grainger created SOLR-14397:
------------------------------------

             Summary: Vector Search in Solr
                 Key: SOLR-14397
                 URL: https://issues.apache.org/jira/browse/SOLR-14397
             Project: Solr
          Issue Type: Improvement
      Security Level: Public (Default Security Level. Issues are Public)
            Reporter: Trey Grainger


Search engines have traditionally relied upon token-based matching (typically 
keywords) on an inverted index, plus relevance ranking based upon keyword 
occurrence statistics. This can be viewed as a "sparse vector” match (where 
each term is a one-hot encoded dimension in the vector), since only a few 
keywords out of all possible keywords are considered in each query. With the 
introduction of deep-learning-based transformers over the last few years, 
however, the state of the art in relevance has moved to ranking models based 
upon dense vectors that encode a latent, semantic understanding of both 
language constructs and the underlying domain upon which the model was trained. 
These dense vectors are also referred to as “embeddings”. An example of this 
kind of embedding would be taking the phrase “chief executive officer of the 
tech company” and converting it to [0.03, 1.7, 9.12, 0, 0.3]
. Other similar phrases should encode to vectors with very similar numbers, so 
we may expect a query like “CEO of a technology org” to generate a vector like 
[0.1, 1.9, 8.9, 0.1, 0.4]. When performing a cosine similarity calculation 
between these vectors, we would expect a number closer to 1.0, whereas a very 
unrelated text blurb would generate a much smaller cosine similarity.

This is a proposal for how we should implement these vector search capabilities 
in Solr.
h1. Search Process Overview:

In order to implement dense vector search, the following process is typically 
followed:
h2. Offline:

An encoder is built. An encoder can take in text (a query, a sentence, a 
paragraph, a document, etc.) and return a dense vector representing that 
document in a rich semantic space. The semantic space is learned from training 
on textual data (usually, though other sources work, too), typically from the 
domain of the search engine.
h2. Document Ingestion:

When documents are processed, they are passed to the encoder, and the dense 
vector(s) returned are stored as fields on the document. There could be one or 
more vectors per-document, as the granularity of the vectors could be 
per-document, per field, per paragraph, per-sentence, or even per phrase or per 
term.
h2. Query Time:

*Encoding:* The query is translated to a dense vector by passing it to the 
encoder
 Quantization: The query is quantized. Quantization is the process of taking a 
vector with many values and turning it into “terms” in a vector space that 
approximates the full vector space of the dense vectors.
 *ANN Matching:* A query on the quantized vector tokens is executed as an ANN 
(approximate nearest neighbor) search. This allows finding most of the best 
matching documents (typically up to 95%) with a traditional and efficient 
lookup against the inverted index.
 (optional) ANN Ranking: ranking may be performed based upon the matched 
quantized tokens to get a rough, initial ranking of documents based upon the 
similarity of the query and document vectors. This allows the next step 
(re-ranking) to be performed on a smaller subset of documents. 
 *Re-Ranking:* Once the initial matching (and optionally ANN ranking) is 
performed, a similarity calculation (cosine, dot-product, or any number of 
other calculations) is typically performed between the full (non-quantized) 
dense vectors for the query and those in the document. This re-ranking will 
typically be on the top-N results for performance reasons.
 *Return Results:* As with any search, the final step is typically to return 
the results in relevance-ranked order. In this case, that would be sorted by 
the re-ranking similarity score (i.e. “cosine descending”).
 ------

*Variant:* For small document sets, it may be preferable to rank all documents 
and skip steps steps 2, 3, and 4. This is because ANN Matching typically 
reduces recall (current state of the art is around 95% recall), so it can be 
beneficial to rank all documents if performance is not a concern. In this case, 
step 5 is performed on the full doc set and would obviously just be considered 
“ranking” instead of “re-”ranking.
h1. Proposed Implementation in Solr:
h2. Phase 1: Storage of Dense Vectors & Scoring on Vector Similarity
 * 
h3. Dense Vector Field:

We will add a new dense vector field type in Solr. This field type would be a 
compressed encoding of a dense vector into a BinaryDocValues Field. There are 
other ways to do it, but this is almost certain to be the most efficient.
 Ideally this field is multi-valued. If it is single-valued then we are either 
limited to only document-level vectors, or otherwise we have to create many 
vector fields (i.e. per paragraph or term) and search across them, which will 
never be practical or scale well. BinaryDocValues does not natively support 
multiple values, but since the Solr field type will need to be responsible for 
encoding the numeric vectors as a byte[] it should also be able to encode 
multiple vectors at the same time.
 * 
h3. Vector Similarity Value Sources:

Implement function queries (value sources) that take in a query vector, a 
(multi-valued) vector field name, and a multi-value selector ("max", "min" 
,"avg", "first"), and return the computed calculation. This function could then 
be used to score documents via the “func” query parser. ANN queries could be 
implemented using separate “fq” params for filtering, or as part of the “q” 
query parameter to support optional ANN Ranking, and then the existing Solr 
re-ranking functionality could be used for re-ranking the top N results using 
the “func” query parser and these new cosine or dot product value sources as 
the re-rank query. Existing Solr logic for performing two-phase iteration will 
ensure that these vector functions will not be excessively computed for 
documents that do not already match the “cheaper” ANN queries.

*Example*:
{noformat}
q={!func}cosine("1.55,3.53,2.3,0.7,3.44,2.33", vectors_field, first){noformat}
^ simple, no-ANN. Third param is optional and is the function 
(min|max|avg|first|last). I think this should default to "max" (or whatever 
"most related" is for each similarity function. It's "max" for cosine and dot 
product).

*Example:*
{noformat}
q={!func}cosine("1.55,3.53,2.3,0.7,3.44,2.33", vectors_field, 
first)&fq=[ANN_QUERY]{noformat}
^ ensures function is not computed unless ANN query matches due to two-phase 
iterator

*Example:*
{noformat}
q=[ANN_QUERY]&rq={!rerank reRankQuery=$rqq reRankDocs=10000 
reRankWeight=1000000}&rqq={!func}cosine("1.55,3.53,2.3,0.7,3.44,2.33", 
vectors_field){noformat}
^Supports an initial ranking based upon the ANN QUERY, followed by a re-ranking 
based upon the cosine function.

The key implementation challenges are expected to be:
 * ensuring decent query (speed) performance
 * Determining an efficient vector/byte encoding that has decent 
disk-space/query time decoding trade offs and makes it possible to decode only 
the ‘first’ vector per doc in cases where that is all that’s requested by the 
distance function

h2. Phase 2: Pluggable ANN Search
 # Implement pluggable quantization/ANN search. There are multiple different 
algorithms to do quantization for efficient ANN search. The simplest is 
something like LSH, with more advanced implementations being worked on in 
Lucene right now in LUCENE-9004 (HNSW) and LUCENE-9136 (IVFFLAT), and with a 
prototyped LSH implementation (Solr Vector Search plugin) and a random 
projections approach (Hangry) linked to in SOLR-12890.


 # Given that the transformers/encoders are a rapidly evolving area of data 
science and NLP research, it feels unreasonable to assume that Solr will be 
able to keep up to date on the latest model integrations. Given that, if we 
build support into Solr for encoder models, we need to make it very pluggable.


 # Phase 1 fully enables quantization to be done OUTSIDE of Solr, and thus for 
external systems to accomplish quantization by creating quantized terms 
externally, indexing them to Solr, and then sending in queries to Solr with the 
quantized terms. I’m not convinced that keeping quantization outside of 
Lucene/Solr isn’t a more manageable model long term, but if certain models do 
get integrated natively into Lucene (as is currently being worked on) then we 
should obviously look to leverage them.

h2. Important Considerations:
 * Multi-valued dense vectors are a key differentiating feature here. I’m 
considering this a key requirement and designing accordingly. This will allow 
for multiple embeddings per document (i.e. “word embeddings, sentence 
embeddings, paragraph embeddings, etc.) to be modeled. I like the way 
[~ehatcher] designed the the Payload Function Query and think we should model 
the multi-valued handling similarly.


 * As best as possible, it would be nice to support multiple (pluggable) 
fields/value sources that can be converted to dense vectors and compared. At a 
minimum we need an efficient dense vector field, but it may be beneficial to 
support a sparse vector field where only meaningful values need to be encoded, 
in order to save space. Additionally, there may be some utility in other field 
types or arbitrary value sources that can product vector-like output to be able 
to be leveraged in the function queries being proposed.

h1. Note on Prior Approaches
 * There are a handful of ways to perform vector search today with Solr. One is 
through Streaming Expressions, and alternatively there are a few plugins which 
have implemented this functionality for Solr. One of these was contributed in 
SOLR-12890, and I've outlined examples of most of these approaches in the 
comments on SOLR-12890.


 * The Streaming Expressions approach works well out of the box, but doesn't 
integrate nicely with traditional keyword search use cases, so we need a 
solution that integrates with real-time queries and enables the full suite of 
Solr's query-time features.
 * The most advanced Solr plugin out there seems to be this one (forked the 
version in SOLR-12890 and then added LSH functionality for ANN): 
[https://github.com/moshebla/solr-vector-scoring

]
 * I encountered several challenges when working with the implementation in 
SOLR-12890 which informed the above proposed design:

1. The plugin encodes dense vectors into payloads attached to terms 
representing vector positions. This is pretty inefficient and is way too slow 
at significant scale. I believe we instead need the underlying field types to 
be binary doc values field with efficient compression and decoding for optimal 
performance when scanning through all the matching documents and dimensions.

2. The plugin only supports one vector per-field, making supporting word/phrase 
vectors, sentence vectors, paragraph vectors, etc. challenging. It's not 
obvious how to modify the current approach to support multiple vectors per 
field.

3. The plugin uses a query parser instead of just a function query, which 
prevents re-use of the vector similarity function. Using a function query 
instead will enable the vector similarity scores to be easily leveraged for 
relevance scoring, sorting, returned as fields, or used within other function 
queries. Given that there are two goals of 1) filtering on ANN, and 2) Scoring 
on vector similarity (usually via re-ranking), using a function query for the 
scoring piece will be more flexible across use scoring use cases (combining 
with other functions, returning, sorting, etc.)

4.  The LSH quantization/ANN implementation is a cool idea, but it is 
hard-coded in the implementation. I recommend making it a separate filter and 
making the implementation more pluggable (though for now this can be 
externalized and passed in to Solr). We may be able to pull the LSH work in 
during phase 2.

For cleaner discussion and tracking, I'm moving this revised proposal to this 
new JIRA, and we can use SOLR-12890 for continue discussion (if any) of the 
previously contributed plugin implementation.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to