RE: Indexing data from multiple datasources

David Ross Thu, 09 Jun 2011 12:23:42 -0700

This thread got me thinking a bit...
Does SOLR support the concept of "partial updates" to documents?  By this I 
mean updating a subset of fields in a document that already exists in the 
index, and without having to resubmit the entire document.
An example would be storing/indexing user tags associated with documents. These 
tags will not be available when the document is initially presented to SOLR, 
and may or may not come along at a later time. When that time comes, can we 
just submit the tag data (and document identifier I'd imagine), or do we have 
to import the entire document?
new to SOLR...


> Date: Thu, 9 Jun 2011 14:00:43 -0400
> Subject: Re: Indexing data from multiple datasources
> From: erickerick...@gmail.com
> To: solr-user@lucene.apache.org
> 
> How are you using it? Streaming the files to Solr via HTTP? You can use Tika
> on the client to extract the various bits from the structured documents, and
> use SolrJ to assemble various bits of that data Tika exposes into a
> Solr document
> that you then send to Solr. At the point you're transferring data from the
> Tika parse to the Solr document, you could add any data from your database 
> that
> you wanted.
> 
> The result is that you'd be indexing the complete Solr document only once.
> 
> You're right that updating a document in Solr overwrites the previous
> version and any
> data in the previous version is lost....
> 
> Best
> Erick
> 
> On Thu, Jun 9, 2011 at 1:20 PM, Greg Georges <greg.geor...@biztree.com> wrote:
> > Hello Erick,
> >
> > Thanks for the response. No, I am using the extract handler to extract the 
> > data from my text files. In your second approach, you say I could use a DIH 
> > to update the index which would have been created by the extract handler in 
> > the first phase. I thought that lets say I get info from the DB and update 
> > the index with the document ID, will I overwrite the data and lose the 
> > initial data from the extract handler phase? Thanks
> >
> > Greg
> >
> > -----Original Message-----
> > From: Erick Erickson [mailto:erickerick...@gmail.com]
> > Sent: 9 juin 2011 12:15
> > To: solr-user@lucene.apache.org
> > Subject: Re: Indexing data from multiple datasources
> >
> > Hmmm, when you say you use Tika, are you using some custom Java code? 
> > Because
> > if you are, the best thing to do is query your database at that point
> > and add whatever information
> > you need to the document.
> >
> > If you're using DIH to do the crawl, consider implementing a
> > Transformer to do the database
> > querying and modify the document as necessary.... This is pretty
> > simple to do, we can
> > chat a bit more depending on whether either approach makes sense.
> >
> > Best
> > Erick
> >
> >
> >
> > On Thu, Jun 9, 2011 at 10:43 AM, Greg Georges <greg.geor...@biztree.com> 
> > wrote:
> >> Hello all,
> >>
> >> I have checked the forums to see if it is possible to create and index 
> >> from multiple datasources. I have found references to SOLR-1358, but I 
> >> don't think this fits my scenario. In all, we have an application where we 
> >> upload files. On the file upload, I use the Tika extract handler to save 
> >> metadata from the file (_attr, literal values, etc..). We also have a 
> >> database which has information on the uploaded files, like the category, 
> >> type, etc.. I would like to update the index to include this information 
> >> from the db in the index for each document. If I run a dataimporthandler 
> >> after the extract phase I am afraid that by updating the doc in the index 
> >> by its id will just cause that I overwrite the old information with the 
> >> info from the DB (what I understand is that Solr updates its index by ID 
> >> by deleting first then recreating the info).
> >>
> >> Anyone have any pointers, is there a clean way to do this, or must I find 
> >> a way to pass the db metadata to the extract handler and save it as 
> >> literal fields?
> >>
> >> Thanks in advance
> >>
> >> Greg
> >>
> >

RE: Indexing data from multiple datasources

Reply via email to