Re: How to use DataImportHandler with ExtractingRequestHandler?
On Fri, Nov 20, 2009 at 9:13 PM, javaxmlsoapdev vika...@yahoo.com wrote: did you extend DIH to do this work? can you share code samples. I have similar requirement where I need tp index database records and each record has a column with document path so need to create another index for documents (we allow users to search both index separately) in parallel with reading some meta data of documents from database as well. I have all sorts of different document formats to index. fyi; I am on solr 1.4.0. Any pointers would be appreciated. He did not extend DIH for this. He extracted out text from his documents and saved them into files and used XPathEntityProcessor (you can use PlainTextEntityProcessor) to index them. I don't know much about ExtractionRequestHandler but if you want to use DIH, you'll have to extend it to add Tika support. You may want to look at a couple of open issues: 1. https://issues.apache.org/jira/browse/SOLR-1358 2. https://issues.apache.org/jira/browse/SOLR-1583 -- Regards, Shalin Shekhar Mangar.
Re: How to use DataImportHandler with ExtractingRequestHandler?
Anyone any idea? javaxmlsoapdev wrote: did you extend DIH to do this work? can you share code samples. I have similar requirement where I need tp index database records and each record has a column with document path so need to create another index for documents (we allow users to search both index separately) in parallel with reading some meta data of documents from database as well. I have all sorts of different document formats to index. I am on solr 1.4.0. Any pointers would be appreciated. Thanks, -- View this message in context: http://old.nabble.com/How-to-use-DataImportHandler-with-ExtractingRequestHandler--tp25267745p26485245.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: How to use DataImportHandler with ExtractingRequestHandler?
did you extend DIH to do this work? can you share code samples. I have similar requirement where I need tp index database records and each record has a column with document path so need to create another index for documents (we allow users to search both index separately) in parallel with reading some meta data of documents from database as well. I have all sorts of different document formats to index. fyi; I am on solr 1.4.0. Any pointers would be appreciated. Thanks, Sascha Szott wrote: Hi Khai, a few weeks ago, I was facing the same problem. In my case, this workaround helped (assuming, you're using Solr 1.3): For each row, extract the content from the corresponding pdf file using a parser library of your choice (I suggest Apache PDFBox or Apache Tika in case you need to process other file types as well), put it between foo![CDATA[ and ]]/foo and store it in a text file. To keep the relationship between a file and its corresponding database row, use the primary key as the file name. Within data-config.xml use the XPathEntityProcessor as follows (replace dbRow and primaryKey respectively): entity name=pdfcontent processor=XPathEntityProcessor forEach=/foo url=${dbRow.primaryKey}.xml field column=pdftext xpath=/foo/ /entity And, by the way, in Solr 1.4 you do not have to put your content between xml tags: use the PlainTextEntityProcessor instead of XPathEntityProcessor. Best, Sascha Khai Doan schrieb: Hi all, My name is Khai. I have a table in a relational database. I have successfully use DataImportHandler to import this data into Apache Solr. However, one of the column store the location of PDF file. How can I configure DataImportHandler to use ExtractingRequestHandler to extract the content of the PDF? Thanks! Khai Doan -- View this message in context: http://old.nabble.com/How-to-use-DataImportHandler-with-ExtractingRequestHandler--tp25267745p26443544.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: How to use DataImportHandler with ExtractingRequestHandler?
Hi Khai, a few weeks ago, I was facing the same problem. In my case, this workaround helped (assuming, you're using Solr 1.3): For each row, extract the content from the corresponding pdf file using a parser library of your choice (I suggest Apache PDFBox or Apache Tika in case you need to process other file types as well), put it between foo![CDATA[ and ]]/foo and store it in a text file. To keep the relationship between a file and its corresponding database row, use the primary key as the file name. Within data-config.xml use the XPathEntityProcessor as follows (replace dbRow and primaryKey respectively): entity name=pdfcontent processor=XPathEntityProcessor forEach=/foo url=${dbRow.primaryKey}.xml field column=pdftext xpath=/foo/ /entity And, by the way, in Solr 1.4 you do not have to put your content between xml tags: use the PlainTextEntityProcessor instead of XPathEntityProcessor. Best, Sascha Khai Doan schrieb: Hi all, My name is Khai. I have a table in a relational database. I have successfully use DataImportHandler to import this data into Apache Solr. However, one of the column store the location of PDF file. How can I configure DataImportHandler to use ExtractingRequestHandler to extract the content of the PDF? Thanks! Khai Doan
How to use DataImportHandler with ExtractingRequestHandler?
Hi all, My name is Khai. I have a table in a relational database. I have successfully use DataImportHandler to import this data into Apache Solr. However, one of the column store the location of PDF file. How can I configure DataImportHandler to use ExtractingRequestHandler to extract the content of the PDF? Thanks! Khai Doan
Re: How to use DataImportHandler with ExtractingRequestHandler?
unfortunately DIH is not yet integrated with ExtractingRequestHandler . see this https://issues.apache.org/jira/browse/SOLR-1358 On Thu, Sep 3, 2009 at 5:34 AM, Khai Doankhaitd...@gmail.com wrote: Hi all, My name is Khai. I have a table in a relational database. I have successfully use DataImportHandler to import this data into Apache Solr. However, one of the column store the location of PDF file. How can I configure DataImportHandler to use ExtractingRequestHandler to extract the content of the PDF? Thanks! Khai Doan -- - Noble Paul | Principal Engineer| AOL | http://aol.com