Hello,

i am not sure if this is the right place to ask, so feel free to redirect
me if needed.

To my knowledge, Solr does not support multiple DenseVector fields per
document, so i have been working on a solution to allow multiple
(sentence-based) embeddings per document.

I have tried several approaches and now have a working solution, though i
am uncertain about its performance and practicality.

Currently, during indexing, i assign each document a set of dynamic fields
named according to the convention "sentence{fieldIndex}_vector," where
{fieldIndex} starts from zero. For each sentence in a document, i add a
corresponding dynamic field that stores the embedding for that sentence.
This approach enables me to have multiple embeddings per document. My
schema setup looks like this:

<fieldType name="knn_vector" class="solr.DenseVectorField"
vectorDimension="128" similarityFunction="cosine"/>
<dynamicField name="*_vector" type="knn_vector" indexed="true"
stored="true"/>

To process these fields, i implemented a custom query parser that loops
through each document and calculates the difference between my query vector
and each embedding in the document (each *_vector field). To achieve this,
I iterate through all fields in a document, checking if a specific dynamic
field with an index is set (e.g. i check for "sentence0_vector," perform
the comparison, then check "sentence1_vector," and so on). After this, i
have a list of ValueSource objects that i can pass into a some max function.

I have two questions regarding this approach:

Is there a way to check if a document has a value for a given dynamic
field? Using Collection<SchemaField> allFields =
this.req.getSchema().getFields().values(); only retrieves non-dynamic
fields, and no error is thrown when accessing a dynamic field on a document
that lacks a value for it. As a result, i have to specify a maximum number
of fields statically. The code of my function is at the end.

Second question is, whether this approach is efficient, or could it
interfere with Solr's caching and optimizations? I did not find much
documentation beyond the source code, but i read about some sort of
clustering to allow for a faster searching of a big vector space. For a
relatively small corpus (~5000 documents, each containing 3-8 sentences),
query times range between 80ms and 400ms. Is this an indication that my
approach is inefficient? Would it be better to index each sentence as a
separate document instead?

Thank you in advance for any hints or tips!

Function code:
private List<ValueSource> collectDocumentVectors() throws SyntaxError {
    List<ValueSource> documentVectors = new ArrayList<>();
    int fieldIndex = 0;
    int maxIterations = 10;

    while (fieldIndex < maxIterations) {
        String fieldName = "sentence" + fieldIndex + "_vector";  //
Dynamically construct the field name

        try {
            // get field from schema
            SchemaField field = this.req.getSchema().getField(fieldName);

            // cast as dense vector field
            DenseVectorField vectorField = requireVectorType(field);

            // get value source for this field and add it to the list
            ValueSource vectorSource = vectorField.getValueSource(field,
this);
            documentVectors.add(vectorSource);
        } catch (Exception e) {
            // NEVER REACHED

            if (fieldIndex == 0) {
                // if its the sentence0_vector field throw an error because
it is required
                throw new SyntaxError("Required field 'sentence0_vector' is
missing in the schema.");
            }
            // exit the loop if any other indexed field is not found
            break;
        }
        fieldIndex++;
    }

    // ensure at least one vector field was found
    if (documentVectors.isEmpty()) {
        throw new SyntaxError("No vector fields were found. At least the
'0_vector' field is required.");
    }

    return documentVectors;
}

Reply via email to