Re: How to use DataImportHandler with ExtractingRequestHandler?

2009-11-24 Thread Shalin Shekhar Mangar
On Fri, Nov 20, 2009 at 9:13 PM, javaxmlsoapdev vika...@yahoo.com wrote:


 did you extend DIH to do this work? can you share code samples. I have
 similar requirement where I need tp index database records and each record
 has a column with document path so need to create another index for
 documents (we allow users to search both index separately) in parallel with
 reading some meta data of documents from database as well. I have all sorts
 of different document formats to index. fyi; I am on solr 1.4.0. Any
 pointers would be appreciated.


He did not extend DIH for this. He extracted out text from his documents and
saved them into files and used XPathEntityProcessor (you can use
PlainTextEntityProcessor) to index them.

I don't know much about ExtractionRequestHandler but if you want to use DIH,
you'll have to extend it to add Tika support. You may want to look at a
couple of open issues:

   1. https://issues.apache.org/jira/browse/SOLR-1358
   2. https://issues.apache.org/jira/browse/SOLR-1583

-- 
Regards,
Shalin Shekhar Mangar.


Re: How to use DataImportHandler with ExtractingRequestHandler?

2009-11-23 Thread javaxmlsoapdev

Anyone any idea?

javaxmlsoapdev wrote:
 
 did you extend DIH to do this work? can you share code samples. I have
 similar requirement where I need tp index database records and each record
 has a column with document path so need to create another index for
 documents (we allow users to search both index separately) in parallel
 with reading some meta data of documents from database as well. I have all
 sorts of different document formats to index. I am on solr 1.4.0. Any
 pointers would be appreciated.
 
 Thanks,
 
 
 

-- 
View this message in context: 
http://old.nabble.com/How-to-use-DataImportHandler-with-ExtractingRequestHandler--tp25267745p26485245.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: How to use DataImportHandler with ExtractingRequestHandler?

2009-11-20 Thread javaxmlsoapdev

did you extend DIH to do this work? can you share code samples. I have
similar requirement where I need tp index database records and each record
has a column with document path so need to create another index for
documents (we allow users to search both index separately) in parallel with
reading some meta data of documents from database as well. I have all sorts
of different document formats to index. fyi; I am on solr 1.4.0. Any
pointers would be appreciated.

Thanks,

Sascha Szott wrote:
 
 Hi Khai,
 
 a few weeks ago, I was facing the same problem.
 
 In my case, this workaround helped (assuming, you're using Solr 1.3): 
 For each row, extract the content from the corresponding pdf file using 
 a parser library of your choice (I suggest Apache PDFBox or Apache Tika 
 in case you need to process other file types as well), put it between
 
   foo![CDATA[
 
 and
 
   ]]/foo
 
 and store it in a text file. To keep the relationship between a file and 
 its corresponding database row, use the primary key as the file name.
 
 Within data-config.xml use the XPathEntityProcessor as follows (replace 
 dbRow and primaryKey respectively):
 
 entity name=pdfcontent
   processor=XPathEntityProcessor
   forEach=/foo
   url=${dbRow.primaryKey}.xml
field column=pdftext xpath=/foo/
 /entity
 
 
 And, by the way, in Solr 1.4 you do not have to put your content between 
 xml tags: use the PlainTextEntityProcessor instead of
 XPathEntityProcessor.
 
 Best,
 Sascha
 
 Khai Doan schrieb:
 Hi all,
 
 My name is Khai.  I have a table in a relational database.  I have
 successfully use DataImportHandler to import this data into Apache Solr.
 However, one of the column store the location of PDF file.  How can I
 configure DataImportHandler to use ExtractingRequestHandler to extract
 the
 content of the PDF?
 
 Thanks!
 
 Khai Doan
 
 
 
 

-- 
View this message in context: 
http://old.nabble.com/How-to-use-DataImportHandler-with-ExtractingRequestHandler--tp25267745p26443544.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: How to use DataImportHandler with ExtractingRequestHandler?

2009-09-03 Thread Sascha Szott

Hi Khai,

a few weeks ago, I was facing the same problem.

In my case, this workaround helped (assuming, you're using Solr 1.3): 
For each row, extract the content from the corresponding pdf file using 
a parser library of your choice (I suggest Apache PDFBox or Apache Tika 
in case you need to process other file types as well), put it between


foo![CDATA[

and

]]/foo

and store it in a text file. To keep the relationship between a file and 
its corresponding database row, use the primary key as the file name.


Within data-config.xml use the XPathEntityProcessor as follows (replace 
dbRow and primaryKey respectively):


entity name=pdfcontent
processor=XPathEntityProcessor
forEach=/foo
url=${dbRow.primaryKey}.xml
  field column=pdftext xpath=/foo/
/entity


And, by the way, in Solr 1.4 you do not have to put your content between 
xml tags: use the PlainTextEntityProcessor instead of XPathEntityProcessor.


Best,
Sascha

Khai Doan schrieb:

Hi all,

My name is Khai.  I have a table in a relational database.  I have
successfully use DataImportHandler to import this data into Apache Solr.
However, one of the column store the location of PDF file.  How can I
configure DataImportHandler to use ExtractingRequestHandler to extract the
content of the PDF?

Thanks!

Khai Doan





How to use DataImportHandler with ExtractingRequestHandler?

2009-09-02 Thread Khai Doan
Hi all,

My name is Khai.  I have a table in a relational database.  I have
successfully use DataImportHandler to import this data into Apache Solr.
However, one of the column store the location of PDF file.  How can I
configure DataImportHandler to use ExtractingRequestHandler to extract the
content of the PDF?

Thanks!

Khai Doan


Re: How to use DataImportHandler with ExtractingRequestHandler?

2009-09-02 Thread Noble Paul നോബിള്‍ नोब्ळ्
unfortunately DIH is not yet integrated with ExtractingRequestHandler .
see this https://issues.apache.org/jira/browse/SOLR-1358



On Thu, Sep 3, 2009 at 5:34 AM, Khai Doankhaitd...@gmail.com wrote:
 Hi all,

 My name is Khai.  I have a table in a relational database.  I have
 successfully use DataImportHandler to import this data into Apache Solr.
 However, one of the column store the location of PDF file.  How can I
 configure DataImportHandler to use ExtractingRequestHandler to extract the
 content of the PDF?

 Thanks!

 Khai Doan




-- 
-
Noble Paul | Principal Engineer| AOL | http://aol.com