Re: indexing pdf files using post tool

2016-03-19 Thread Francisco Andrés Fernández
Vidya, I don't know if I'm understanding it very well but, I think that the
best way is to parse your text using a routine outside Solr. You might need
to map the different parts of your document using your domain knowledge and
use such routine to produce an XML document for example, with corresponding
tags for any part you need to differentiate. After that you could index it
in Solr.
Francisco

El mié., 16 de mar. de 2016 a la(s) 04:18, vidya <vidya.nade...@tcs.com>
escribió:

> Sorry for conveying it in wrong way. I want my data of 1 pdf file to be
> indexed with different fields in a document of solr according to data in it
> like name;id;title;content etc
>
> Thanks
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/indexing-pdf-files-using-post-tool-tp4263811p4264052.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


Re: indexing pdf files using post tool

2016-03-19 Thread Binoy Dalal
Take a look at the CloneFieldUpdateProcessorFactory here:
http://www.solr-start.com/info/update-request-processors/

On Wed, 16 Mar 2016, 18:25 Binoy Dalal, <binoydala...@gmail.com> wrote:

> Like Francisco said, use a custom update processor to map the fields the
> way you want and add it to your update chain.
>
> On Wed, 16 Mar 2016, 18:16 Francisco Andrés Fernández, <fra...@gmail.com>
> wrote:
>
>> Vidya, I don't know if I'm understanding it very well but, I think that
>> the
>> best way is to parse your text using a routine outside Solr. You might
>> need
>> to map the different parts of your document using your domain knowledge
>> and
>> use such routine to produce an XML document for example, with
>> corresponding
>> tags for any part you need to differentiate. After that you could index it
>> in Solr.
>> Francisco
>>
>> El mié., 16 de mar. de 2016 a la(s) 04:18, vidya <vidya.nade...@tcs.com>
>> escribió:
>>
>> > Sorry for conveying it in wrong way. I want my data of 1 pdf file to be
>> > indexed with different fields in a document of solr according to data
>> in it
>> > like name;id;title;content etc
>> >
>> > Thanks
>> >
>> >
>> >
>> > --
>> > View this message in context:
>> >
>> http://lucene.472066.n3.nabble.com/indexing-pdf-files-using-post-tool-tp4263811p4264052.html
>> > Sent from the Solr - User mailing list archive at Nabble.com.
>> >
>>
> --
> Regards,
> Binoy Dalal
>
-- 
Regards,
Binoy Dalal


Re: indexing pdf files using post tool

2016-03-18 Thread Binoy Dalal
Like Francisco said, use a custom update processor to map the fields the
way you want and add it to your update chain.

On Wed, 16 Mar 2016, 18:16 Francisco Andrés Fernández, <fra...@gmail.com>
wrote:

> Vidya, I don't know if I'm understanding it very well but, I think that the
> best way is to parse your text using a routine outside Solr. You might need
> to map the different parts of your document using your domain knowledge and
> use such routine to produce an XML document for example, with corresponding
> tags for any part you need to differentiate. After that you could index it
> in Solr.
> Francisco
>
> El mié., 16 de mar. de 2016 a la(s) 04:18, vidya <vidya.nade...@tcs.com>
> escribió:
>
> > Sorry for conveying it in wrong way. I want my data of 1 pdf file to be
> > indexed with different fields in a document of solr according to data in
> it
> > like name;id;title;content etc
> >
> > Thanks
> >
> >
> >
> > --
> > View this message in context:
> >
> http://lucene.472066.n3.nabble.com/indexing-pdf-files-using-post-tool-tp4263811p4264052.html
> > Sent from the Solr - User mailing list archive at Nabble.com.
> >
>
-- 
Regards,
Binoy Dalal


Re: indexing pdf files using post tool

2016-03-18 Thread Jan Høydahl
Hi

You can look at the Apache Tika project or the PDFBox project to parse your 
files before sending to Solr.
Alternatively, if your processing is very simple, you can use the built-in Tika 
as U just did, and
then deploy some UpdateRequestProcessor’s in order to modify the Tika output 
into whatever fields you like.

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com

> 16. mar. 2016 kl. 08.18 skrev vidya <vidya.nade...@tcs.com>:
> 
> Sorry for conveying it in wrong way. I want my data of 1 pdf file to be
> indexed with different fields in a document of solr according to data in it
> like name;id;title;content etc
> 
> Thanks 
> 
> 
> 
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/indexing-pdf-files-using-post-tool-tp4263811p4264052.html
> Sent from the Solr - User mailing list archive at Nabble.com.



Re: indexing pdf files using post tool

2016-03-16 Thread vidya
Sorry for conveying it in wrong way. I want my data of 1 pdf file to be
indexed with different fields in a document of solr according to data in it
like name;id;title;content etc

Thanks 



--
View this message in context: 
http://lucene.472066.n3.nabble.com/indexing-pdf-files-using-post-tool-tp4263811p4264052.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: indexing pdf files using post tool

2016-03-15 Thread roshan agarwal
Yes vidya, you just have to use copy field

Roshan

On Tue, Mar 15, 2016 at 3:07 PM, vidya <vidya.nade...@tcs.com> wrote:

> Hi
> I got data into my content field. But i wanted to have differnt fields to
> be
> allocated for data in my file.How can I achieve this ?
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/indexing-pdf-files-using-post-tool-tp4263811p4263840.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>



-- 
Roshan Agarwal
Managing Director
Siddhast IP Innovation (P) Ltd
Phone: +91 11-65246257
M:+91 9871549769
email: ros...@siddhast.com
-
About SIDDHAST(www.siddhast.com)
SIDDHAST is a research and analytical company, which provide service in the
following area-Intellectual Property, Market Research, Business
Research,Technology Transfer. The company is Incorporated in March 2007,
and has completed more than 100 assignments.
URL: www.siddhast.com

--
This message (including attachments, if any) is confidential and may be
privileged. Before opening the attachments please check them for viruses
and defects. M/s Siddhast Intellectual Property Innovations Pvt Ltd will
not be responsible for any viruses or defects or any forwarded attachments
emanating either from within SIDDHAST or outside.


Re: indexing pdf files using post tool

2016-03-15 Thread Binoy Dalal
You should use copy fields.
https://cwiki.apache.org/confluence/display/solr/Copying+Fields

On Tue, 15 Mar 2016, 15:07 vidya, <vidya.nade...@tcs.com> wrote:

> Hi
> I got data into my content field. But i wanted to have differnt fields to
> be
> allocated for data in my file.How can I achieve this ?
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/indexing-pdf-files-using-post-tool-tp4263811p4263840.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>
-- 
Regards,
Binoy Dalal


Re: indexing pdf files using post tool

2016-03-15 Thread vidya
Hi
I got data into my content field. But i wanted to have differnt fields to be
allocated for data in my file.How can I achieve this ?



--
View this message in context: 
http://lucene.472066.n3.nabble.com/indexing-pdf-files-using-post-tool-tp4263811p4263840.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: indexing pdf files using post tool

2016-03-15 Thread Binoy Dalal
Do you have a "content" field defined in your schema? Is it stored?

By default, the content from the docs uploaded through post should be
mapped to a field called "content".

On Tue, 15 Mar 2016, 12:47 vidya, <vidya.nade...@tcs.com> wrote:

> Hi
> I am trying to index a pdf file by using post tool in my linux system,When
> i
> give the command
> bin/post -c core2 -p 8984 /root/solr/My_CV.pdf
> it is showing the search results like
> "response": {
> "numFound": 1,
> "start": 0,
> "docs": [
>   {
> "id": "/root/solr-5.5.0/My_CV.pdf",
> "meta_creation_date": [
>   "2016-03-15T06:22:17Z"
> ],
> "pdf_pdfversion": [
>   1.4
> ],
> "dcterms_created": [
>   "2016-03-15T06:22:17Z"
> ],
> "x_parsed_by": [
>   "org.apache.tika.parser.DefaultParser",
>   "org.apache.tika.parser.pdf.PDFParser"
> ],
> "xmptpg_npages": [
>   1
> ],
> "creation_date": [
>   "2016-03-15T06:22:17Z"
> ],
> "pdf_encrypted": [
>   false
> ],
> "title": [
>   "My CV"
> ],
> "stream_content_type": [
>   "application/pdf"
> ],
> "created": [
>   "Tue Mar 15 06:22:17 UTC 2016"
> ],
> "stream_size": [
>   18289
> ],
> "dc_format": [
>   "application/pdf; version=1.4"
> ],
> "producer": [
>   "wkhtmltopdf"
> ],
> "content_type": [
>   "application/pdf"
>     ],
>     "xmp_creatortool": [
>   "þÿ"
> ],
> "resourcename": [
>   "/root/solr/My_CV.pdf"
> ],
> "dc_title": [
>   "My CV"
> ],
> "_version_": 1528851429701189600
>   }
>
>
> but not the actual content in pdf file.
> How to index that dat.
> Please help me on this.
> Can post tool be used for indexing data from HDFS ?
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/indexing-pdf-files-using-post-tool-tp4263811.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>
-- 
Regards,
Binoy Dalal


indexing pdf files using post tool

2016-03-15 Thread vidya
Hi
I am trying to index a pdf file by using post tool in my linux system,When i
give the command
bin/post -c core2 -p 8984 /root/solr/My_CV.pdf
it is showing the search results like 
"response": {
"numFound": 1,
"start": 0,
"docs": [
  {
"id": "/root/solr-5.5.0/My_CV.pdf",
"meta_creation_date": [
  "2016-03-15T06:22:17Z"
],
"pdf_pdfversion": [
  1.4
],
"dcterms_created": [
  "2016-03-15T06:22:17Z"
],
"x_parsed_by": [
  "org.apache.tika.parser.DefaultParser",
  "org.apache.tika.parser.pdf.PDFParser"
],
"xmptpg_npages": [
  1
],
"creation_date": [
  "2016-03-15T06:22:17Z"
],
"pdf_encrypted": [
  false
],
"title": [
  "My CV"
],
"stream_content_type": [
  "application/pdf"
],
"created": [
  "Tue Mar 15 06:22:17 UTC 2016"
],
"stream_size": [
  18289
],
"dc_format": [
  "application/pdf; version=1.4"
],
"producer": [
  "wkhtmltopdf"
],
"content_type": [
  "application/pdf"
],
"xmp_creatortool": [
  "þÿ"
],
"resourcename": [
  "/root/solr/My_CV.pdf"
    ],
"dc_title": [
  "My CV"
],
"_version_": 1528851429701189600
  }


but not the actual content in pdf file.
How to index that dat.
Please help me on this.
Can post tool be used for indexing data from HDFS ?



--
View this message in context: 
http://lucene.472066.n3.nabble.com/indexing-pdf-files-using-post-tool-tp4263811.html
Sent from the Solr - User mailing list archive at Nabble.com.