RE: Indexing multiple pdf's and partial update of pdf

Jay Parashar Thu, 24 Mar 2016 07:57:48 -0700


Thanks Reth,




Yes I am using Apache Tike and went by the instructions given in

https://cwiki.apache.org/confluence/display/solr/Uploading+Data+with+Solr+Cell+using+Apache+Tika



Here I see we can index a pdf " solr-word.pdf" to a document with unique key = 
"doc1" as below



curl 
'http://localhost:8983/solr/techproducts/update/extract?literal.id=doc1&commit=true'
 -F 
"myfile=@example/exampledocs/solr-word.pdf<mailto:myfile=@example/exampledocs/solr-word.pdf>"



My requirement is to index another separate pdf to this document with key = 
doc1. Basically I need the contents of both pdfs to be searchable and related 
to the id=doc1.



What comes to my mind is to perform an 'extractOnly' as below on both pdf's and 
then index the concatenation of the contents. Is there another less invasive 
way?



curl "http://localhost:8983/solr/techproducts/update/extract?&extractOnly=true"; 
--data-binary @example/exampledocs/sample.html -H 'Content-type:text/html'



Thanks

Jay



-----Original Message-----
From: Reth RM [mailto:reth.ik...@gmail.com]
Sent: Thursday, March 24, 2016 12:24 AM
To: solr-user@lucene.apache.org
Subject: Re: Indexing multiple pdf's and partial update of pdf



Are you using apache tika parser to parse pdf files?



1) Solr support parent-child block join using which you can index more than one 
file data within document object(if that is what you are looking for) 
https://urldefense.proofpoint.com/v2/url?u=https-3A__cwiki.apache.org_confluence_display_solr_Other-2BParsers-23OtherParsers-2DBlockJoinQueryParsers&d=CwIFaQ&c=uGuXJ43KPkPWEl2imVFDmZQlhQUET7pVRA2PDIOxgqw&r=bRfqJEeedEKG5nkp5748YxbNMFrUYT3YiNl0Ni2vUBQ&m=mjl8EQIh28bi_at9AocrELmHdF6oGMDz4_-rPAaBWrI&s=83RBCYuuwc7iI4KAzkPMsyNThtsMqr9Bp9QOk1lr_fU&e=



2) If the unique key of the document that exists in index is equal to new 
document that you are reindexing, it will be overwritten. If you'd like to do 
partial updates via curl, here are some examples listed :

https://urldefense.proofpoint.com/v2/url?u=http-3A__yonik.com_solr_atomic-2Dupdates_&d=CwIFaQ&c=uGuXJ43KPkPWEl2imVFDmZQlhQUET7pVRA2PDIOxgqw&r=bRfqJEeedEKG5nkp5748YxbNMFrUYT3YiNl0Ni2vUBQ&m=mjl8EQIh28bi_at9AocrELmHdF6oGMDz4_-rPAaBWrI&s=RnLUMlzU69Qr6D2NPbCH9wig6JLekcfwfGu9kC9l9DA&e=











On Thu, Mar 24, 2016 at 3:43 AM, Jay Parashar 
<bparas...@slb.com<mailto:bparas...@slb.com>> wrote:



> Hi,

>

> I have couple of questions regarding indexing files (say pdf).

>

> 1)      Is there any way to index more than one file to one document with

> a unique id?

>

> One way I think is to do a “extractOnly” of all the documents and then

> index that extract separately. Is there an easier way?

>

> 2)      If my Solr document has existing fields populated and then I index

> a pdf, it seems it overwrites the document with the end result being

> just the contents of the pdf. I know we can do partial updates using

> SolrJ but is it possible to do partial updates of pdf using curl?

>

>

> Thanks

> Jay

>

RE: Indexing multiple pdf's and partial update of pdf

Reply via email to