Re: solr cell/tika: pdf import with xml metatags

Grant Ingersoll Tue, 27 Oct 2009 03:43:53 -0700

On Oct 27, 2009, at 6:36 AM, <markus.rietz...@rzf.fin-nrw.de> <markus.rietz...@rzf.fin-nrw.de> wrote:

hi,

we want to use SOLR as our intranet search engine.
i downloaded the nightly bild of solr 1.4. pdf extraction does viaSolr Cell/Tika. i can send the pdf via curl
to solr.
we do have a large set of meta-tags to all our intranet documents,including PDF, PPT etc. to import htmlfiles from our CMS i have access to all of this meta tags and createa xml document which i send to SOLR,
eg.

<?xml version='1.0' encoding='UTF-8'?>
<add>
<doc>
<field name="id">1</field>
<field name="title">this is the title</field>
</doc>
<doc>
<field name="id">2</field>
<field name="title">this is another title</field>
</doc>
<doc>
<field name="id">3</field>
<field name="title">this is the third title</field>
</doc>
</add>
this works fine with html files where i can grab all the meta tags,including "body".
so my question is, can i use this xml-document to send a pdf filealso?

I'm not sure what you mean here, can you clarify? PDF and other"rich" documents can't be sent by XML.

ok, one way would be to use
the extracthandler with extract only and put the data in the "body"-field.


I guess all I can point you at right now is the wiki:  
http://wiki.apache.org/solr/ExtractingRequestHandler

--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)using Solr/Lucene:

http://www.lucidimagination.com/search

Re: solr cell/tika: pdf import with xml metatags

Reply via email to