Hello, i am not sure if this is the right place to ask, so feel free to redirect me if needed.
To my knowledge, Solr does not support multiple DenseVector fields per document, so i have been working on a solution to allow multiple (sentence-based) embeddings per document. I have tried several approaches and now have a working solution, though i am uncertain about its performance and practicality. Currently, during indexing, i assign each document a set of dynamic fields named according to the convention "sentence{fieldIndex}_vector," where {fieldIndex} starts from zero. For each sentence in a document, i add a corresponding dynamic field that stores the embedding for that sentence. This approach enables me to have multiple embeddings per document. My schema setup looks like this: <fieldType name="knn_vector" class="solr.DenseVectorField" vectorDimension="128" similarityFunction="cosine"/> <dynamicField name="*_vector" type="knn_vector" indexed="true" stored="true"/> To process these fields, i implemented a custom query parser that loops through each document and calculates the difference between my query vector and each embedding in the document (each *_vector field). To achieve this, I iterate through all fields in a document, checking if a specific dynamic field with an index is set (e.g. i check for "sentence0_vector," perform the comparison, then check "sentence1_vector," and so on). After this, i have a list of ValueSource objects that i can pass into a some max function. I have two questions regarding this approach: Is there a way to check if a document has a value for a given dynamic field? Using Collection<SchemaField> allFields = this.req.getSchema().getFields().values(); only retrieves non-dynamic fields, and no error is thrown when accessing a dynamic field on a document that lacks a value for it. As a result, i have to specify a maximum number of fields statically. The code of my function is at the end. Second question is, whether this approach is efficient, or could it interfere with Solr's caching and optimizations? I did not find much documentation beyond the source code, but i read about some sort of clustering to allow for a faster searching of a big vector space. For a relatively small corpus (~5000 documents, each containing 3-8 sentences), query times range between 80ms and 400ms. Is this an indication that my approach is inefficient? Would it be better to index each sentence as a separate document instead? Thank you in advance for any hints or tips! Function code: private List<ValueSource> collectDocumentVectors() throws SyntaxError { List<ValueSource> documentVectors = new ArrayList<>(); int fieldIndex = 0; int maxIterations = 10; while (fieldIndex < maxIterations) { String fieldName = "sentence" + fieldIndex + "_vector"; // Dynamically construct the field name try { // get field from schema SchemaField field = this.req.getSchema().getField(fieldName); // cast as dense vector field DenseVectorField vectorField = requireVectorType(field); // get value source for this field and add it to the list ValueSource vectorSource = vectorField.getValueSource(field, this); documentVectors.add(vectorSource); } catch (Exception e) { // NEVER REACHED if (fieldIndex == 0) { // if its the sentence0_vector field throw an error because it is required throw new SyntaxError("Required field 'sentence0_vector' is missing in the schema."); } // exit the loop if any other indexed field is not found break; } fieldIndex++; } // ensure at least one vector field was found if (documentVectors.isEmpty()) { throw new SyntaxError("No vector fields were found. At least the '0_vector' field is required."); } return documentVectors; }