did you extend DIH to do this work? can you share code samples. I have
similar requirement where I need tp index database records and each record
has a column with document path so need to create another index for
documents (we allow users to search both index separately) in parallel with
reading some meta data of documents from database as well. I have all sorts
of different document formats to index. fyi; I am on solr 1.4.0. Any
pointers would be appreciated.

Thanks,

Sascha Szott wrote:
> 
> Hi Khai,
> 
> a few weeks ago, I was facing the same problem.
> 
> In my case, this workaround helped (assuming, you're using Solr 1.3): 
> For each row, extract the content from the corresponding pdf file using 
> a parser library of your choice (I suggest Apache PDFBox or Apache Tika 
> in case you need to process other file types as well), put it between
> 
>       <foo><![CDATA[
> 
> and
> 
>       ]]></foo>
> 
> and store it in a text file. To keep the relationship between a file and 
> its corresponding database row, use the primary key as the file name.
> 
> Within data-config.xml use the XPathEntityProcessor as follows (replace 
> dbRow and primaryKey respectively):
> 
> <entity name="pdfcontent"
>       processor="XPathEntityProcessor"
>       forEach="/foo"
>       url="${dbRow.primaryKey}.xml">
>    <field column="pdftext" xpath="/foo"/>
> </entity>
> 
> 
> And, by the way, in Solr 1.4 you do not have to put your content between 
> xml tags: use the PlainTextEntityProcessor instead of
> XPathEntityProcessor.
> 
> Best,
> Sascha
> 
> Khai Doan schrieb:
>> Hi all,
>> 
>> My name is Khai.  I have a table in a relational database.  I have
>> successfully use DataImportHandler to import this data into Apache Solr.
>> However, one of the column store the location of PDF file.  How can I
>> configure DataImportHandler to use ExtractingRequestHandler to extract
>> the
>> content of the PDF?
>> 
>> Thanks!
>> 
>> Khai Doan
>> 
> 
> 
> 

-- 
View this message in context: 
http://old.nabble.com/How-to-use-DataImportHandler-with-ExtractingRequestHandler--tp25267745p26443544.html
Sent from the Solr - User mailing list archive at Nabble.com.

Reply via email to