On Oct 27, 2009, at 6:36 AM, <markus.rietz...@rzf.fin-nrw.de> <markus.rietz...@rzf.fin-nrw.de
> wrote:
hi,
we want to use SOLR as our intranet search engine.
i downloaded the nightly bild of solr 1.4. pdf extraction does via
Solr Cell/Tika. i can send the pdf via curl
to solr.
we do have a large set of meta-tags to all our intranet documents,
including PDF, PPT etc. to import html
files from our CMS i have access to all of this meta tags and create
a xml document which i send to SOLR,
eg.
<?xml version='1.0' encoding='UTF-8'?>
<add>
<doc>
<field name="id">1</field>
<field name="title">this is the title</field>
</doc>
<doc>
<field name="id">2</field>
<field name="title">this is another title</field>
</doc>
<doc>
<field name="id">3</field>
<field name="title">this is the third title</field>
</doc>
</add>
this works fine with html files where i can grab all the meta tags,
including "body".
so my question is, can i use this xml-document to send a pdf file
also?
I'm not sure what you mean here, can you clarify? PDF and other
"rich" documents can't be sent by XML.
ok, one way would be to use
the extracthandler with extract only and put the data in the "body"-
field.
I guess all I can point you at right now is the wiki:
http://wiki.apache.org/solr/ExtractingRequestHandler
--------------------------
Grant Ingersoll
http://www.lucidimagination.com/
Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)
using Solr/Lucene:
http://www.lucidimagination.com/search