Hello everyone,
I hope Im posting this to the right place. Ive been working extensively
with the TextToVector features from the Solr-LLM module. I use the Update
Processor to embed chunks from crawled documents, into a search engine.
For performance reasons, I had to rework my process to use atomic update,
instead of embeddings all documents at indexing.
Here is the processor chain I use (in solrconfig.xml):
<updateRequestProcessorChain name="datafari-embed">
<processor
class="solr.llm.texttovector.update.processor.TextToVectorUpdateProcessorFac
tory">
<str name="inputField">embedded_content</str>
<str
name="outputField">${texttovector.outputfield:vector_1536}</str>
<str name="model">${texttovector.model:default_model}</str>
</processor>
<processor
class="com.francelabs.datafari.updateprocessor.TextToVectorUpdateProcessorFa
ctory">
<str name="enabled">true</str>
<str
name="outputField">${texttovector.outputfield:vector_1536}</str>
</processor>
<processor class="solr.LogUpdateProcessorFactory"/>
<processor class="solr.DistributedUpdateProcessorFactory"/>
<processor class="solr.RunUpdateProcessorFactory"/>
</updateRequestProcessorChain>
Here is the workflow:
* A background job crawls the collection, and sends atomic update
requests for each document.
* These requests target the /update/embed endpoint, using the
processor chain above.
* The request is processed, the embedded_content is embedded, and
stored in the outputField (knn dense vector field)
Here is an example of atomic update request, similar to those generated by
the job:
curl -X POST "http://localhost:8983/solr/VectorMain/update/embed" \
-H "Content-Type: application/json" \
-d '[
{
"id": "file://///localhost/mini/helloworld.txt_0",
"embedded_content": { "set": "Hello world" }
}
]'
I use the langchain4J OpenAI to call my own LLM API to process the
embeddings. However, the embedding model receives {set=Hello world}
instead of just "Hello world", which breaks the semantic vector generation.
For now, I am using Solr 9.8. I saw that Solr 9.9 documentation mentioned
partial update for vector embeddings
(https://solr.apache.org/guide/solr/latest/query-guide/text-to-vector.html).
Has this issue been fixed in 9.9 ? Is there a recommended workaround or
patch to ensure that only the string value is passed to the embedding model,
and not the atomic update syntax itself?
Thank you !
Kind regards,
Emeric Bernet-Rollande
France Labs Your knowledge, now
Datafari Enterprise Search - Retrouvez-nous au salon
<https://www.bigdataparis.com/> Big Data & IA les 1 et 2 octobre à Paris,
stand C31
<https://www.bigdataparis.com/>