This thread got me thinking a bit... Does SOLR support the concept of "partial updates" to documents? By this I mean updating a subset of fields in a document that already exists in the index, and without having to resubmit the entire document. An example would be storing/indexing user tags associated with documents. These tags will not be available when the document is initially presented to SOLR, and may or may not come along at a later time. When that time comes, can we just submit the tag data (and document identifier I'd imagine), or do we have to import the entire document? new to SOLR...
> Date: Thu, 9 Jun 2011 14:00:43 -0400 > Subject: Re: Indexing data from multiple datasources > From: erickerick...@gmail.com > To: solr-user@lucene.apache.org > > How are you using it? Streaming the files to Solr via HTTP? You can use Tika > on the client to extract the various bits from the structured documents, and > use SolrJ to assemble various bits of that data Tika exposes into a > Solr document > that you then send to Solr. At the point you're transferring data from the > Tika parse to the Solr document, you could add any data from your database > that > you wanted. > > The result is that you'd be indexing the complete Solr document only once. > > You're right that updating a document in Solr overwrites the previous > version and any > data in the previous version is lost.... > > Best > Erick > > On Thu, Jun 9, 2011 at 1:20 PM, Greg Georges <greg.geor...@biztree.com> wrote: > > Hello Erick, > > > > Thanks for the response. No, I am using the extract handler to extract the > > data from my text files. In your second approach, you say I could use a DIH > > to update the index which would have been created by the extract handler in > > the first phase. I thought that lets say I get info from the DB and update > > the index with the document ID, will I overwrite the data and lose the > > initial data from the extract handler phase? Thanks > > > > Greg > > > > -----Original Message----- > > From: Erick Erickson [mailto:erickerick...@gmail.com] > > Sent: 9 juin 2011 12:15 > > To: solr-user@lucene.apache.org > > Subject: Re: Indexing data from multiple datasources > > > > Hmmm, when you say you use Tika, are you using some custom Java code? > > Because > > if you are, the best thing to do is query your database at that point > > and add whatever information > > you need to the document. > > > > If you're using DIH to do the crawl, consider implementing a > > Transformer to do the database > > querying and modify the document as necessary.... This is pretty > > simple to do, we can > > chat a bit more depending on whether either approach makes sense. > > > > Best > > Erick > > > > > > > > On Thu, Jun 9, 2011 at 10:43 AM, Greg Georges <greg.geor...@biztree.com> > > wrote: > >> Hello all, > >> > >> I have checked the forums to see if it is possible to create and index > >> from multiple datasources. I have found references to SOLR-1358, but I > >> don't think this fits my scenario. In all, we have an application where we > >> upload files. On the file upload, I use the Tika extract handler to save > >> metadata from the file (_attr, literal values, etc..). We also have a > >> database which has information on the uploaded files, like the category, > >> type, etc.. I would like to update the index to include this information > >> from the db in the index for each document. If I run a dataimporthandler > >> after the extract phase I am afraid that by updating the doc in the index > >> by its id will just cause that I overwrite the old information with the > >> info from the DB (what I understand is that Solr updates its index by ID > >> by deleting first then recreating the info). > >> > >> Anyone have any pointers, is there a clean way to do this, or must I find > >> a way to pass the db metadata to the extract handler and save it as > >> literal fields? > >> > >> Thanks in advance > >> > >> Greg > >> > >