Re: Indexing data from multiple datasources

2011-06-09 Thread Tom Gross

it's a feature request since ages ...

https://issues.apache.org/jira/browse/SOLR-139

On 06/09/2011 09:25 PM, Greg Georges wrote:

No from what I understand, the way Solr does an update is to delete the 
document, then recreate all the fields, there is no partial updating of the 
file.. maybe because of performance issues or locking?

-Original Message-
From: David Ross [mailto:davidtr...@hotmail.com]
Sent: 9 juin 2011 15:23
To: solr-user@lucene.apache.org
Subject: RE: Indexing data from multiple datasources


This thread got me thinking a bit...
Does SOLR support the concept of "partial updates" to documents?  By this I 
mean updating a subset of fields in a document that already exists in the index, and 
without having to resubmit the entire document.
An example would be storing/indexing user tags associated with documents. These 
tags will not be available when the document is initially presented to SOLR, 
and may or may not come along at a later time. When that time comes, can we 
just submit the tag data (and document identifier I'd imagine), or do we have 
to import the entire document?
new to SOLR...


Date: Thu, 9 Jun 2011 14:00:43 -0400
Subject: Re: Indexing data from multiple datasources
From: erickerick...@gmail.com
To: solr-user@lucene.apache.org

How are you using it? Streaming the files to Solr via HTTP? You can use Tika
on the client to extract the various bits from the structured documents, and
use SolrJ to assemble various bits of that data Tika exposes into a
Solr document
that you then send to Solr. At the point you're transferring data from the
Tika parse to the Solr document, you could add any data from your database that
you wanted.

The result is that you'd be indexing the complete Solr document only once.

You're right that updating a document in Solr overwrites the previous
version and any
data in the previous version is lost

Best
Erick

On Thu, Jun 9, 2011 at 1:20 PM, Greg Georges  wrote:

Hello Erick,

Thanks for the response. No, I am using the extract handler to extract the data 
from my text files. In your second approach, you say I could use a DIH to 
update the index which would have been created by the extract handler in the 
first phase. I thought that lets say I get info from the DB and update the 
index with the document ID, will I overwrite the data and lose the initial data 
from the extract handler phase? Thanks

Greg

-Original Message-
From: Erick Erickson [mailto:erickerick...@gmail.com]
Sent: 9 juin 2011 12:15
To: solr-user@lucene.apache.org
Subject: Re: Indexing data from multiple datasources

Hmmm, when you say you use Tika, are you using some custom Java code? Because
if you are, the best thing to do is query your database at that point
and add whatever information
you need to the document.

If you're using DIH to do the crawl, consider implementing a
Transformer to do the database
querying and modify the document as necessary This is pretty
simple to do, we can
chat a bit more depending on whether either approach makes sense.

Best
Erick



On Thu, Jun 9, 2011 at 10:43 AM, Greg Georges  wrote:

Hello all,

I have checked the forums to see if it is possible to create and index from 
multiple datasources. I have found references to SOLR-1358, but I don't think 
this fits my scenario. In all, we have an application where we upload files. On 
the file upload, I use the Tika extract handler to save metadata from the file 
(_attr, literal values, etc..). We also have a database which has information 
on the uploaded files, like the category, type, etc.. I would like to update 
the index to include this information from the db in the index for each 
document. If I run a dataimporthandler after the extract phase I am afraid that 
by updating the doc in the index by its id will just cause that I overwrite the 
old information with the info from the DB (what I understand is that Solr 
updates its index by ID by deleting first then recreating the info).

Anyone have any pointers, is there a clean way to do this, or must I find a way 
to pass the db metadata to the extract handler and save it as literal fields?

Thanks in advance

Greg









--
Auther of the book "Plone 3 Multimedia" - http://amzn.to/dtrp0C

Tom Gross
email.@toms-projekte.de
skype.tom_gross
web.http://toms-projekte.de
blog...http://blog.toms-projekte.de



RE: Indexing data from multiple datasources

2011-06-09 Thread Greg Georges
No from what I understand, the way Solr does an update is to delete the 
document, then recreate all the fields, there is no partial updating of the 
file.. maybe because of performance issues or locking?

-Original Message-
From: David Ross [mailto:davidtr...@hotmail.com] 
Sent: 9 juin 2011 15:23
To: solr-user@lucene.apache.org
Subject: RE: Indexing data from multiple datasources


This thread got me thinking a bit...
Does SOLR support the concept of "partial updates" to documents?  By this I 
mean updating a subset of fields in a document that already exists in the 
index, and without having to resubmit the entire document.
An example would be storing/indexing user tags associated with documents. These 
tags will not be available when the document is initially presented to SOLR, 
and may or may not come along at a later time. When that time comes, can we 
just submit the tag data (and document identifier I'd imagine), or do we have 
to import the entire document?
new to SOLR...

> Date: Thu, 9 Jun 2011 14:00:43 -0400
> Subject: Re: Indexing data from multiple datasources
> From: erickerick...@gmail.com
> To: solr-user@lucene.apache.org
> 
> How are you using it? Streaming the files to Solr via HTTP? You can use Tika
> on the client to extract the various bits from the structured documents, and
> use SolrJ to assemble various bits of that data Tika exposes into a
> Solr document
> that you then send to Solr. At the point you're transferring data from the
> Tika parse to the Solr document, you could add any data from your database 
> that
> you wanted.
> 
> The result is that you'd be indexing the complete Solr document only once.
> 
> You're right that updating a document in Solr overwrites the previous
> version and any
> data in the previous version is lost
> 
> Best
> Erick
> 
> On Thu, Jun 9, 2011 at 1:20 PM, Greg Georges  wrote:
> > Hello Erick,
> >
> > Thanks for the response. No, I am using the extract handler to extract the 
> > data from my text files. In your second approach, you say I could use a DIH 
> > to update the index which would have been created by the extract handler in 
> > the first phase. I thought that lets say I get info from the DB and update 
> > the index with the document ID, will I overwrite the data and lose the 
> > initial data from the extract handler phase? Thanks
> >
> > Greg
> >
> > -Original Message-
> > From: Erick Erickson [mailto:erickerick...@gmail.com]
> > Sent: 9 juin 2011 12:15
> > To: solr-user@lucene.apache.org
> > Subject: Re: Indexing data from multiple datasources
> >
> > Hmmm, when you say you use Tika, are you using some custom Java code? 
> > Because
> > if you are, the best thing to do is query your database at that point
> > and add whatever information
> > you need to the document.
> >
> > If you're using DIH to do the crawl, consider implementing a
> > Transformer to do the database
> > querying and modify the document as necessary This is pretty
> > simple to do, we can
> > chat a bit more depending on whether either approach makes sense.
> >
> > Best
> > Erick
> >
> >
> >
> > On Thu, Jun 9, 2011 at 10:43 AM, Greg Georges  
> > wrote:
> >> Hello all,
> >>
> >> I have checked the forums to see if it is possible to create and index 
> >> from multiple datasources. I have found references to SOLR-1358, but I 
> >> don't think this fits my scenario. In all, we have an application where we 
> >> upload files. On the file upload, I use the Tika extract handler to save 
> >> metadata from the file (_attr, literal values, etc..). We also have a 
> >> database which has information on the uploaded files, like the category, 
> >> type, etc.. I would like to update the index to include this information 
> >> from the db in the index for each document. If I run a dataimporthandler 
> >> after the extract phase I am afraid that by updating the doc in the index 
> >> by its id will just cause that I overwrite the old information with the 
> >> info from the DB (what I understand is that Solr updates its index by ID 
> >> by deleting first then recreating the info).
> >>
> >> Anyone have any pointers, is there a clean way to do this, or must I find 
> >> a way to pass the db metadata to the extract handler and save it as 
> >> literal fields?
> >>
> >> Thanks in advance
> >>
> >> Greg
> >>
> >
  


RE: Indexing data from multiple datasources

2011-06-09 Thread David Ross

This thread got me thinking a bit...
Does SOLR support the concept of "partial updates" to documents?  By this I 
mean updating a subset of fields in a document that already exists in the 
index, and without having to resubmit the entire document.
An example would be storing/indexing user tags associated with documents. These 
tags will not be available when the document is initially presented to SOLR, 
and may or may not come along at a later time. When that time comes, can we 
just submit the tag data (and document identifier I'd imagine), or do we have 
to import the entire document?
new to SOLR...

> Date: Thu, 9 Jun 2011 14:00:43 -0400
> Subject: Re: Indexing data from multiple datasources
> From: erickerick...@gmail.com
> To: solr-user@lucene.apache.org
> 
> How are you using it? Streaming the files to Solr via HTTP? You can use Tika
> on the client to extract the various bits from the structured documents, and
> use SolrJ to assemble various bits of that data Tika exposes into a
> Solr document
> that you then send to Solr. At the point you're transferring data from the
> Tika parse to the Solr document, you could add any data from your database 
> that
> you wanted.
> 
> The result is that you'd be indexing the complete Solr document only once.
> 
> You're right that updating a document in Solr overwrites the previous
> version and any
> data in the previous version is lost
> 
> Best
> Erick
> 
> On Thu, Jun 9, 2011 at 1:20 PM, Greg Georges  wrote:
> > Hello Erick,
> >
> > Thanks for the response. No, I am using the extract handler to extract the 
> > data from my text files. In your second approach, you say I could use a DIH 
> > to update the index which would have been created by the extract handler in 
> > the first phase. I thought that lets say I get info from the DB and update 
> > the index with the document ID, will I overwrite the data and lose the 
> > initial data from the extract handler phase? Thanks
> >
> > Greg
> >
> > -Original Message-
> > From: Erick Erickson [mailto:erickerick...@gmail.com]
> > Sent: 9 juin 2011 12:15
> > To: solr-user@lucene.apache.org
> > Subject: Re: Indexing data from multiple datasources
> >
> > Hmmm, when you say you use Tika, are you using some custom Java code? 
> > Because
> > if you are, the best thing to do is query your database at that point
> > and add whatever information
> > you need to the document.
> >
> > If you're using DIH to do the crawl, consider implementing a
> > Transformer to do the database
> > querying and modify the document as necessary This is pretty
> > simple to do, we can
> > chat a bit more depending on whether either approach makes sense.
> >
> > Best
> > Erick
> >
> >
> >
> > On Thu, Jun 9, 2011 at 10:43 AM, Greg Georges  
> > wrote:
> >> Hello all,
> >>
> >> I have checked the forums to see if it is possible to create and index 
> >> from multiple datasources. I have found references to SOLR-1358, but I 
> >> don't think this fits my scenario. In all, we have an application where we 
> >> upload files. On the file upload, I use the Tika extract handler to save 
> >> metadata from the file (_attr, literal values, etc..). We also have a 
> >> database which has information on the uploaded files, like the category, 
> >> type, etc.. I would like to update the index to include this information 
> >> from the db in the index for each document. If I run a dataimporthandler 
> >> after the extract phase I am afraid that by updating the doc in the index 
> >> by its id will just cause that I overwrite the old information with the 
> >> info from the DB (what I understand is that Solr updates its index by ID 
> >> by deleting first then recreating the info).
> >>
> >> Anyone have any pointers, is there a clean way to do this, or must I find 
> >> a way to pass the db metadata to the extract handler and save it as 
> >> literal fields?
> >>
> >> Thanks in advance
> >>
> >> Greg
> >>
> >
  

Re: Indexing data from multiple datasources

2011-06-09 Thread Erick Erickson
How are you using it? Streaming the files to Solr via HTTP? You can use Tika
on the client to extract the various bits from the structured documents, and
use SolrJ to assemble various bits of that data Tika exposes into a
Solr document
that you then send to Solr. At the point you're transferring data from the
Tika parse to the Solr document, you could add any data from your database that
you wanted.

The result is that you'd be indexing the complete Solr document only once.

You're right that updating a document in Solr overwrites the previous
version and any
data in the previous version is lost

Best
Erick

On Thu, Jun 9, 2011 at 1:20 PM, Greg Georges  wrote:
> Hello Erick,
>
> Thanks for the response. No, I am using the extract handler to extract the 
> data from my text files. In your second approach, you say I could use a DIH 
> to update the index which would have been created by the extract handler in 
> the first phase. I thought that lets say I get info from the DB and update 
> the index with the document ID, will I overwrite the data and lose the 
> initial data from the extract handler phase? Thanks
>
> Greg
>
> -Original Message-
> From: Erick Erickson [mailto:erickerick...@gmail.com]
> Sent: 9 juin 2011 12:15
> To: solr-user@lucene.apache.org
> Subject: Re: Indexing data from multiple datasources
>
> Hmmm, when you say you use Tika, are you using some custom Java code? Because
> if you are, the best thing to do is query your database at that point
> and add whatever information
> you need to the document.
>
> If you're using DIH to do the crawl, consider implementing a
> Transformer to do the database
> querying and modify the document as necessary This is pretty
> simple to do, we can
> chat a bit more depending on whether either approach makes sense.
>
> Best
> Erick
>
>
>
> On Thu, Jun 9, 2011 at 10:43 AM, Greg Georges  
> wrote:
>> Hello all,
>>
>> I have checked the forums to see if it is possible to create and index from 
>> multiple datasources. I have found references to SOLR-1358, but I don't 
>> think this fits my scenario. In all, we have an application where we upload 
>> files. On the file upload, I use the Tika extract handler to save metadata 
>> from the file (_attr, literal values, etc..). We also have a database which 
>> has information on the uploaded files, like the category, type, etc.. I 
>> would like to update the index to include this information from the db in 
>> the index for each document. If I run a dataimporthandler after the extract 
>> phase I am afraid that by updating the doc in the index by its id will just 
>> cause that I overwrite the old information with the info from the DB (what I 
>> understand is that Solr updates its index by ID by deleting first then 
>> recreating the info).
>>
>> Anyone have any pointers, is there a clean way to do this, or must I find a 
>> way to pass the db metadata to the extract handler and save it as literal 
>> fields?
>>
>> Thanks in advance
>>
>> Greg
>>
>


RE: Indexing data from multiple datasources

2011-06-09 Thread Greg Georges
Hello Erick,

Thanks for the response. No, I am using the extract handler to extract the data 
from my text files. In your second approach, you say I could use a DIH to 
update the index which would have been created by the extract handler in the 
first phase. I thought that lets say I get info from the DB and update the 
index with the document ID, will I overwrite the data and lose the initial data 
from the extract handler phase? Thanks

Greg

-Original Message-
From: Erick Erickson [mailto:erickerick...@gmail.com] 
Sent: 9 juin 2011 12:15
To: solr-user@lucene.apache.org
Subject: Re: Indexing data from multiple datasources

Hmmm, when you say you use Tika, are you using some custom Java code? Because
if you are, the best thing to do is query your database at that point
and add whatever information
you need to the document.

If you're using DIH to do the crawl, consider implementing a
Transformer to do the database
querying and modify the document as necessary This is pretty
simple to do, we can
chat a bit more depending on whether either approach makes sense.

Best
Erick



On Thu, Jun 9, 2011 at 10:43 AM, Greg Georges  wrote:
> Hello all,
>
> I have checked the forums to see if it is possible to create and index from 
> multiple datasources. I have found references to SOLR-1358, but I don't think 
> this fits my scenario. In all, we have an application where we upload files. 
> On the file upload, I use the Tika extract handler to save metadata from the 
> file (_attr, literal values, etc..). We also have a database which has 
> information on the uploaded files, like the category, type, etc.. I would 
> like to update the index to include this information from the db in the index 
> for each document. If I run a dataimporthandler after the extract phase I am 
> afraid that by updating the doc in the index by its id will just cause that I 
> overwrite the old information with the info from the DB (what I understand is 
> that Solr updates its index by ID by deleting first then recreating the info).
>
> Anyone have any pointers, is there a clean way to do this, or must I find a 
> way to pass the db metadata to the extract handler and save it as literal 
> fields?
>
> Thanks in advance
>
> Greg
>


Re: Indexing data from multiple datasources

2011-06-09 Thread Erick Erickson
Hmmm, when you say you use Tika, are you using some custom Java code? Because
if you are, the best thing to do is query your database at that point
and add whatever information
you need to the document.

If you're using DIH to do the crawl, consider implementing a
Transformer to do the database
querying and modify the document as necessary This is pretty
simple to do, we can
chat a bit more depending on whether either approach makes sense.

Best
Erick



On Thu, Jun 9, 2011 at 10:43 AM, Greg Georges  wrote:
> Hello all,
>
> I have checked the forums to see if it is possible to create and index from 
> multiple datasources. I have found references to SOLR-1358, but I don't think 
> this fits my scenario. In all, we have an application where we upload files. 
> On the file upload, I use the Tika extract handler to save metadata from the 
> file (_attr, literal values, etc..). We also have a database which has 
> information on the uploaded files, like the category, type, etc.. I would 
> like to update the index to include this information from the db in the index 
> for each document. If I run a dataimporthandler after the extract phase I am 
> afraid that by updating the doc in the index by its id will just cause that I 
> overwrite the old information with the info from the DB (what I understand is 
> that Solr updates its index by ID by deleting first then recreating the info).
>
> Anyone have any pointers, is there a clean way to do this, or must I find a 
> way to pass the db metadata to the extract handler and save it as literal 
> fields?
>
> Thanks in advance
>
> Greg
>


Indexing data from multiple datasources

2011-06-09 Thread Greg Georges
Hello all,

I have checked the forums to see if it is possible to create and index from 
multiple datasources. I have found references to SOLR-1358, but I don't think 
this fits my scenario. In all, we have an application where we upload files. On 
the file upload, I use the Tika extract handler to save metadata from the file 
(_attr, literal values, etc..). We also have a database which has information 
on the uploaded files, like the category, type, etc.. I would like to update 
the index to include this information from the db in the index for each 
document. If I run a dataimporthandler after the extract phase I am afraid that 
by updating the doc in the index by its id will just cause that I overwrite the 
old information with the info from the DB (what I understand is that Solr 
updates its index by ID by deleting first then recreating the info).

Anyone have any pointers, is there a clean way to do this, or must I find a way 
to pass the db metadata to the extract handler and save it as literal fields?

Thanks in advance

Greg