Re: indexing pdf files using post tool
Vidya, I don't know if I'm understanding it very well but, I think that the best way is to parse your text using a routine outside Solr. You might need to map the different parts of your document using your domain knowledge and use such routine to produce an XML document for example, with corresponding tags for any part you need to differentiate. After that you could index it in Solr. Francisco El mié., 16 de mar. de 2016 a la(s) 04:18, vidya <vidya.nade...@tcs.com> escribió: > Sorry for conveying it in wrong way. I want my data of 1 pdf file to be > indexed with different fields in a document of solr according to data in it > like name;id;title;content etc > > Thanks > > > > -- > View this message in context: > http://lucene.472066.n3.nabble.com/indexing-pdf-files-using-post-tool-tp4263811p4264052.html > Sent from the Solr - User mailing list archive at Nabble.com. >
Re: indexing pdf files using post tool
Take a look at the CloneFieldUpdateProcessorFactory here: http://www.solr-start.com/info/update-request-processors/ On Wed, 16 Mar 2016, 18:25 Binoy Dalal, <binoydala...@gmail.com> wrote: > Like Francisco said, use a custom update processor to map the fields the > way you want and add it to your update chain. > > On Wed, 16 Mar 2016, 18:16 Francisco Andrés Fernández, <fra...@gmail.com> > wrote: > >> Vidya, I don't know if I'm understanding it very well but, I think that >> the >> best way is to parse your text using a routine outside Solr. You might >> need >> to map the different parts of your document using your domain knowledge >> and >> use such routine to produce an XML document for example, with >> corresponding >> tags for any part you need to differentiate. After that you could index it >> in Solr. >> Francisco >> >> El mié., 16 de mar. de 2016 a la(s) 04:18, vidya <vidya.nade...@tcs.com> >> escribió: >> >> > Sorry for conveying it in wrong way. I want my data of 1 pdf file to be >> > indexed with different fields in a document of solr according to data >> in it >> > like name;id;title;content etc >> > >> > Thanks >> > >> > >> > >> > -- >> > View this message in context: >> > >> http://lucene.472066.n3.nabble.com/indexing-pdf-files-using-post-tool-tp4263811p4264052.html >> > Sent from the Solr - User mailing list archive at Nabble.com. >> > >> > -- > Regards, > Binoy Dalal > -- Regards, Binoy Dalal
Re: indexing pdf files using post tool
Like Francisco said, use a custom update processor to map the fields the way you want and add it to your update chain. On Wed, 16 Mar 2016, 18:16 Francisco Andrés Fernández, <fra...@gmail.com> wrote: > Vidya, I don't know if I'm understanding it very well but, I think that the > best way is to parse your text using a routine outside Solr. You might need > to map the different parts of your document using your domain knowledge and > use such routine to produce an XML document for example, with corresponding > tags for any part you need to differentiate. After that you could index it > in Solr. > Francisco > > El mié., 16 de mar. de 2016 a la(s) 04:18, vidya <vidya.nade...@tcs.com> > escribió: > > > Sorry for conveying it in wrong way. I want my data of 1 pdf file to be > > indexed with different fields in a document of solr according to data in > it > > like name;id;title;content etc > > > > Thanks > > > > > > > > -- > > View this message in context: > > > http://lucene.472066.n3.nabble.com/indexing-pdf-files-using-post-tool-tp4263811p4264052.html > > Sent from the Solr - User mailing list archive at Nabble.com. > > > -- Regards, Binoy Dalal
Re: indexing pdf files using post tool
Hi You can look at the Apache Tika project or the PDFBox project to parse your files before sending to Solr. Alternatively, if your processing is very simple, you can use the built-in Tika as U just did, and then deploy some UpdateRequestProcessor’s in order to modify the Tika output into whatever fields you like. -- Jan Høydahl, search solution architect Cominvent AS - www.cominvent.com > 16. mar. 2016 kl. 08.18 skrev vidya <vidya.nade...@tcs.com>: > > Sorry for conveying it in wrong way. I want my data of 1 pdf file to be > indexed with different fields in a document of solr according to data in it > like name;id;title;content etc > > Thanks > > > > -- > View this message in context: > http://lucene.472066.n3.nabble.com/indexing-pdf-files-using-post-tool-tp4263811p4264052.html > Sent from the Solr - User mailing list archive at Nabble.com.
Re: indexing pdf files using post tool
Sorry for conveying it in wrong way. I want my data of 1 pdf file to be indexed with different fields in a document of solr according to data in it like name;id;title;content etc Thanks -- View this message in context: http://lucene.472066.n3.nabble.com/indexing-pdf-files-using-post-tool-tp4263811p4264052.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: indexing pdf files using post tool
Yes vidya, you just have to use copy field Roshan On Tue, Mar 15, 2016 at 3:07 PM, vidya <vidya.nade...@tcs.com> wrote: > Hi > I got data into my content field. But i wanted to have differnt fields to > be > allocated for data in my file.How can I achieve this ? > > > > -- > View this message in context: > http://lucene.472066.n3.nabble.com/indexing-pdf-files-using-post-tool-tp4263811p4263840.html > Sent from the Solr - User mailing list archive at Nabble.com. > -- Roshan Agarwal Managing Director Siddhast IP Innovation (P) Ltd Phone: +91 11-65246257 M:+91 9871549769 email: ros...@siddhast.com - About SIDDHAST(www.siddhast.com) SIDDHAST is a research and analytical company, which provide service in the following area-Intellectual Property, Market Research, Business Research,Technology Transfer. The company is Incorporated in March 2007, and has completed more than 100 assignments. URL: www.siddhast.com -- This message (including attachments, if any) is confidential and may be privileged. Before opening the attachments please check them for viruses and defects. M/s Siddhast Intellectual Property Innovations Pvt Ltd will not be responsible for any viruses or defects or any forwarded attachments emanating either from within SIDDHAST or outside.
Re: indexing pdf files using post tool
You should use copy fields. https://cwiki.apache.org/confluence/display/solr/Copying+Fields On Tue, 15 Mar 2016, 15:07 vidya, <vidya.nade...@tcs.com> wrote: > Hi > I got data into my content field. But i wanted to have differnt fields to > be > allocated for data in my file.How can I achieve this ? > > > > -- > View this message in context: > http://lucene.472066.n3.nabble.com/indexing-pdf-files-using-post-tool-tp4263811p4263840.html > Sent from the Solr - User mailing list archive at Nabble.com. > -- Regards, Binoy Dalal
Re: indexing pdf files using post tool
Hi I got data into my content field. But i wanted to have differnt fields to be allocated for data in my file.How can I achieve this ? -- View this message in context: http://lucene.472066.n3.nabble.com/indexing-pdf-files-using-post-tool-tp4263811p4263840.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: indexing pdf files using post tool
Do you have a "content" field defined in your schema? Is it stored? By default, the content from the docs uploaded through post should be mapped to a field called "content". On Tue, 15 Mar 2016, 12:47 vidya, <vidya.nade...@tcs.com> wrote: > Hi > I am trying to index a pdf file by using post tool in my linux system,When > i > give the command > bin/post -c core2 -p 8984 /root/solr/My_CV.pdf > it is showing the search results like > "response": { > "numFound": 1, > "start": 0, > "docs": [ > { > "id": "/root/solr-5.5.0/My_CV.pdf", > "meta_creation_date": [ > "2016-03-15T06:22:17Z" > ], > "pdf_pdfversion": [ > 1.4 > ], > "dcterms_created": [ > "2016-03-15T06:22:17Z" > ], > "x_parsed_by": [ > "org.apache.tika.parser.DefaultParser", > "org.apache.tika.parser.pdf.PDFParser" > ], > "xmptpg_npages": [ > 1 > ], > "creation_date": [ > "2016-03-15T06:22:17Z" > ], > "pdf_encrypted": [ > false > ], > "title": [ > "My CV" > ], > "stream_content_type": [ > "application/pdf" > ], > "created": [ > "Tue Mar 15 06:22:17 UTC 2016" > ], > "stream_size": [ > 18289 > ], > "dc_format": [ > "application/pdf; version=1.4" > ], > "producer": [ > "wkhtmltopdf" > ], > "content_type": [ > "application/pdf" > ], > "xmp_creatortool": [ > "þÿ" > ], > "resourcename": [ > "/root/solr/My_CV.pdf" > ], > "dc_title": [ > "My CV" > ], > "_version_": 1528851429701189600 > } > > > but not the actual content in pdf file. > How to index that dat. > Please help me on this. > Can post tool be used for indexing data from HDFS ? > > > > -- > View this message in context: > http://lucene.472066.n3.nabble.com/indexing-pdf-files-using-post-tool-tp4263811.html > Sent from the Solr - User mailing list archive at Nabble.com. > -- Regards, Binoy Dalal
indexing pdf files using post tool
Hi I am trying to index a pdf file by using post tool in my linux system,When i give the command bin/post -c core2 -p 8984 /root/solr/My_CV.pdf it is showing the search results like "response": { "numFound": 1, "start": 0, "docs": [ { "id": "/root/solr-5.5.0/My_CV.pdf", "meta_creation_date": [ "2016-03-15T06:22:17Z" ], "pdf_pdfversion": [ 1.4 ], "dcterms_created": [ "2016-03-15T06:22:17Z" ], "x_parsed_by": [ "org.apache.tika.parser.DefaultParser", "org.apache.tika.parser.pdf.PDFParser" ], "xmptpg_npages": [ 1 ], "creation_date": [ "2016-03-15T06:22:17Z" ], "pdf_encrypted": [ false ], "title": [ "My CV" ], "stream_content_type": [ "application/pdf" ], "created": [ "Tue Mar 15 06:22:17 UTC 2016" ], "stream_size": [ 18289 ], "dc_format": [ "application/pdf; version=1.4" ], "producer": [ "wkhtmltopdf" ], "content_type": [ "application/pdf" ], "xmp_creatortool": [ "þÿ" ], "resourcename": [ "/root/solr/My_CV.pdf" ], "dc_title": [ "My CV" ], "_version_": 1528851429701189600 } but not the actual content in pdf file. How to index that dat. Please help me on this. Can post tool be used for indexing data from HDFS ? -- View this message in context: http://lucene.472066.n3.nabble.com/indexing-pdf-files-using-post-tool-tp4263811.html Sent from the Solr - User mailing list archive at Nabble.com.