Re: Indexing multiple pdf's and partial update of pdf

2016-03-24 Thread Alexandre Rafalovitch
An approach that comes to mind is to use DataImportHandler with PDF
parsing being in the inner definition while indexed entity being at
the parent level. The main issue is how to ensure Tika output from one
PDF does not map to the same fields as from the second one. Maybe give
different prefixes. Then, you might be able to do it either with
UpdateRequestProcessor or with copyfields from those two prefixes into
the common place and ignoring source prefixes.

Disclaimer: not tested.

Regards,
   Alex.

Newsletter and resources for Solr beginners and intermediates:
http://www.solr-start.com/


On 25 March 2016 at 01:39, Jay Parashar <bparas...@slb.com> wrote:
>
>
> Thanks Reth,
>
>
>
> Yes I am using Apache Tike and went by the instructions given in
>
> https://cwiki.apache.org/confluence/display/solr/Uploading+Data+with+Solr+Cell+using+Apache+Tika
>
>
>
> Here I see we can index a pdf " solr-word.pdf" to a document with unique key 
> = "doc1" as below
>
>
>
> curl 
> 'http://localhost:8983/solr/techproducts/update/extract?literal.id=doc1=true'
>  -F 
> "myfile=@example/exampledocs/solr-word.pdf<mailto:myfile=@example/exampledocs/solr-word.pdf>"
>
>
>
> My requirement is to index another separate pdf to this document with key = 
> doc1. Basically I need the contents of both pdfs to be searchable and related 
> to the id=doc1.
>
>
>
> What comes to my mind is to perform an 'extractOnly' as below on both pdf's 
> and then index the concatenation of the contents. Is there another less 
> invasive way?
>
>
>
> curl 
> "http://localhost:8983/solr/techproducts/update/extract?=true; 
> --data-binary @example/exampledocs/sample.html -H 'Content-type:text/html'
>
>
>
> Thanks
>
> Jay
>
>
>
> -Original Message-
> From: Reth RM [mailto:reth.ik...@gmail.com]
> Sent: Thursday, March 24, 2016 12:24 AM
> To: solr-user@lucene.apache.org
> Subject: Re: Indexing multiple pdf's and partial update of pdf
>
>
>
> Are you using apache tika parser to parse pdf files?
>
>
>
> 1) Solr support parent-child block join using which you can index more than 
> one file data within document object(if that is what you are looking for) 
> https://urldefense.proofpoint.com/v2/url?u=https-3A__cwiki.apache.org_confluence_display_solr_Other-2BParsers-23OtherParsers-2DBlockJoinQueryParsers=CwIFaQ=uGuXJ43KPkPWEl2imVFDmZQlhQUET7pVRA2PDIOxgqw=bRfqJEeedEKG5nkp5748YxbNMFrUYT3YiNl0Ni2vUBQ=mjl8EQIh28bi_at9AocrELmHdF6oGMDz4_-rPAaBWrI=83RBCYuuwc7iI4KAzkPMsyNThtsMqr9Bp9QOk1lr_fU=
>
>
>
> 2) If the unique key of the document that exists in index is equal to new 
> document that you are reindexing, it will be overwritten. If you'd like to do 
> partial updates via curl, here are some examples listed :
>
> https://urldefense.proofpoint.com/v2/url?u=http-3A__yonik.com_solr_atomic-2Dupdates_=CwIFaQ=uGuXJ43KPkPWEl2imVFDmZQlhQUET7pVRA2PDIOxgqw=bRfqJEeedEKG5nkp5748YxbNMFrUYT3YiNl0Ni2vUBQ=mjl8EQIh28bi_at9AocrELmHdF6oGMDz4_-rPAaBWrI=RnLUMlzU69Qr6D2NPbCH9wig6JLekcfwfGu9kC9l9DA=
>
>
>
>
>
>
>
>
>
>
>
> On Thu, Mar 24, 2016 at 3:43 AM, Jay Parashar 
> <bparas...@slb.com<mailto:bparas...@slb.com>> wrote:
>
>
>
>> Hi,
>
>>
>
>> I have couple of questions regarding indexing files (say pdf).
>
>>
>
>> 1)  Is there any way to index more than one file to one document with
>
>> a unique id?
>
>>
>
>> One way I think is to do a “extractOnly” of all the documents and then
>
>> index that extract separately. Is there an easier way?
>
>>
>
>> 2)  If my Solr document has existing fields populated and then I index
>
>> a pdf, it seems it overwrites the document with the end result being
>
>> just the contents of the pdf. I know we can do partial updates using
>
>> SolrJ but is it possible to do partial updates of pdf using curl?
>
>>
>
>>
>
>> Thanks
>
>> Jay
>
>>


RE: Indexing multiple pdf's and partial update of pdf

2016-03-24 Thread Jay Parashar


Thanks Reth,



Yes I am using Apache Tike and went by the instructions given in

https://cwiki.apache.org/confluence/display/solr/Uploading+Data+with+Solr+Cell+using+Apache+Tika



Here I see we can index a pdf " solr-word.pdf" to a document with unique key = 
"doc1" as below



curl 
'http://localhost:8983/solr/techproducts/update/extract?literal.id=doc1=true'
 -F 
"myfile=@example/exampledocs/solr-word.pdf<mailto:myfile=@example/exampledocs/solr-word.pdf>"



My requirement is to index another separate pdf to this document with key = 
doc1. Basically I need the contents of both pdfs to be searchable and related 
to the id=doc1.



What comes to my mind is to perform an 'extractOnly' as below on both pdf's and 
then index the concatenation of the contents. Is there another less invasive 
way?



curl "http://localhost:8983/solr/techproducts/update/extract?=true; 
--data-binary @example/exampledocs/sample.html -H 'Content-type:text/html'



Thanks

Jay



-Original Message-
From: Reth RM [mailto:reth.ik...@gmail.com]
Sent: Thursday, March 24, 2016 12:24 AM
To: solr-user@lucene.apache.org
Subject: Re: Indexing multiple pdf's and partial update of pdf



Are you using apache tika parser to parse pdf files?



1) Solr support parent-child block join using which you can index more than one 
file data within document object(if that is what you are looking for) 
https://urldefense.proofpoint.com/v2/url?u=https-3A__cwiki.apache.org_confluence_display_solr_Other-2BParsers-23OtherParsers-2DBlockJoinQueryParsers=CwIFaQ=uGuXJ43KPkPWEl2imVFDmZQlhQUET7pVRA2PDIOxgqw=bRfqJEeedEKG5nkp5748YxbNMFrUYT3YiNl0Ni2vUBQ=mjl8EQIh28bi_at9AocrELmHdF6oGMDz4_-rPAaBWrI=83RBCYuuwc7iI4KAzkPMsyNThtsMqr9Bp9QOk1lr_fU=



2) If the unique key of the document that exists in index is equal to new 
document that you are reindexing, it will be overwritten. If you'd like to do 
partial updates via curl, here are some examples listed :

https://urldefense.proofpoint.com/v2/url?u=http-3A__yonik.com_solr_atomic-2Dupdates_=CwIFaQ=uGuXJ43KPkPWEl2imVFDmZQlhQUET7pVRA2PDIOxgqw=bRfqJEeedEKG5nkp5748YxbNMFrUYT3YiNl0Ni2vUBQ=mjl8EQIh28bi_at9AocrELmHdF6oGMDz4_-rPAaBWrI=RnLUMlzU69Qr6D2NPbCH9wig6JLekcfwfGu9kC9l9DA=











On Thu, Mar 24, 2016 at 3:43 AM, Jay Parashar 
<bparas...@slb.com<mailto:bparas...@slb.com>> wrote:



> Hi,

>

> I have couple of questions regarding indexing files (say pdf).

>

> 1)  Is there any way to index more than one file to one document with

> a unique id?

>

> One way I think is to do a “extractOnly” of all the documents and then

> index that extract separately. Is there an easier way?

>

> 2)  If my Solr document has existing fields populated and then I index

> a pdf, it seems it overwrites the document with the end result being

> just the contents of the pdf. I know we can do partial updates using

> SolrJ but is it possible to do partial updates of pdf using curl?

>

>

> Thanks

> Jay

>


Re: Indexing multiple pdf's and partial update of pdf

2016-03-23 Thread Reth RM
Are you using apache tika parser to parse pdf files?

1) Solr support parent-child block join using which you can index more than
one file data within document object(if that is what you are looking for)
https://cwiki.apache.org/confluence/display/solr/Other+Parsers#OtherParsers-BlockJoinQueryParsers

2) If the unique key of the document that exists in index is equal to new
document that you are reindexing, it will be overwritten. If you'd like to
do partial updates via curl, here are some examples listed :
http://yonik.com/solr/atomic-updates/





On Thu, Mar 24, 2016 at 3:43 AM, Jay Parashar  wrote:

> Hi,
>
> I have couple of questions regarding indexing files (say pdf).
>
> 1)  Is there any way to index more than one file to one document with
> a unique id?
>
> One way I think is to do a “extractOnly” of all the documents and then
> index that extract separately. Is there an easier way?
>
> 2)  If my Solr document has existing fields populated and then I index
> a pdf, it seems it overwrites the document with the end result being just
> the contents of the pdf. I know we can do partial updates using SolrJ but
> is it possible to do partial updates of pdf using curl?
>
>
> Thanks
> Jay
>