subject:"How to use DataImportHandler with ExtractingRequestHandler\?"

Re: How to use DataImportHandler with ExtractingRequestHandler?

2009-11-24 Thread Shalin Shekhar Mangar

On Fri, Nov 20, 2009 at 9:13 PM, javaxmlsoapdev vika...@yahoo.com wrote:


 did you extend DIH to do this work? can you share code samples. I have
 similar requirement where I need tp index database records and each record
 has a column with document path so need to create another index for
 documents (we allow users to search both index separately) in parallel with
 reading some meta data of documents from database as well. I have all sorts
 of different document formats to index. fyi; I am on solr 1.4.0. Any
 pointers would be appreciated.


He did not extend DIH for this. He extracted out text from his documents and
saved them into files and used XPathEntityProcessor (you can use
PlainTextEntityProcessor) to index them.

I don't know much about ExtractionRequestHandler but if you want to use DIH,
you'll have to extend it to add Tika support. You may want to look at a
couple of open issues:

   1. https://issues.apache.org/jira/browse/SOLR-1358
   2. https://issues.apache.org/jira/browse/SOLR-1583

-- 
Regards,
Shalin Shekhar Mangar.

Re: How to use DataImportHandler with ExtractingRequestHandler?

2009-11-23 Thread javaxmlsoapdev


Anyone any idea?

javaxmlsoapdev wrote:
 
 did you extend DIH to do this work? can you share code samples. I have
 similar requirement where I need tp index database records and each record
 has a column with document path so need to create another index for
 documents (we allow users to search both index separately) in parallel
 with reading some meta data of documents from database as well. I have all
 sorts of different document formats to index. I am on solr 1.4.0. Any
 pointers would be appreciated.
 
 Thanks,
 
 
 

-- 
View this message in context: 
http://old.nabble.com/How-to-use-DataImportHandler-with-ExtractingRequestHandler--tp25267745p26485245.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: How to use DataImportHandler with ExtractingRequestHandler?

2009-11-20 Thread javaxmlsoapdev

did you extend DIH to do this work? can you share code samples. I have
similar requirement where I need tp index database records and each record
has a column with document path so need to create another index for
documents (we allow users to search both index separately) in parallel with
reading some meta data of documents from database as well. I have all sorts
of different document formats to index. fyi; I am on solr 1.4.0. Any
pointers would be appreciated.

Thanks,

Sascha Szott wrote:

Hi Khai,

a few weeks ago, I was facing the same problem.

In my case, this workaround helped (assuming, you're using Solr 1.3):
For each row, extract the content from the corresponding pdf file using
a parser library of your choice (I suggest Apache PDFBox or Apache Tika
in case you need to process other file types as well), put it between

foo![CDATA[

and

]]/foo

and store it in a text file. To keep the relationship between a file and
its corresponding database row, use the primary key as the file name.

Within data-config.xml use the XPathEntityProcessor as follows (replace
dbRow and primaryKey respectively):

entity name=pdfcontent
processor=XPathEntityProcessor
forEach=/foo
url=${dbRow.primaryKey}.xml
field column=pdftext xpath=/foo/
/entity

And, by the way, in Solr 1.4 you do not have to put your content between
xml tags: use the PlainTextEntityProcessor instead of
XPathEntityProcessor.

Best,
Sascha

Khai Doan schrieb:
Hi all,

My name is Khai. I have a table in a relational database. I have
successfully use DataImportHandler to import this data into Apache Solr.
However, one of the column store the location of PDF file. How can I
configure DataImportHandler to use ExtractingRequestHandler to extract
the
content of the PDF?

Thanks!

Khai Doan

--
View this message in context:
http://old.nabble.com/How-to-use-DataImportHandler-with-ExtractingRequestHandler--tp25267745p26443544.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: How to use DataImportHandler with ExtractingRequestHandler?

2009-09-03 Thread Sascha Szott


Hi Khai,

a few weeks ago, I was facing the same problem.

In my case, this workaround helped (assuming, you're using Solr 1.3): 
For each row, extract the content from the corresponding pdf file using 
a parser library of your choice (I suggest Apache PDFBox or Apache Tika 
in case you need to process other file types as well), put it between


foo![CDATA[

and

]]/foo

and store it in a text file. To keep the relationship between a file and 
its corresponding database row, use the primary key as the file name.


Within data-config.xml use the XPathEntityProcessor as follows (replace 
dbRow and primaryKey respectively):


entity name=pdfcontent
processor=XPathEntityProcessor
forEach=/foo
url=${dbRow.primaryKey}.xml
  field column=pdftext xpath=/foo/
/entity


And, by the way, in Solr 1.4 you do not have to put your content between 
xml tags: use the PlainTextEntityProcessor instead of XPathEntityProcessor.


Best,
Sascha

Khai Doan schrieb:

Hi all,

My name is Khai.  I have a table in a relational database.  I have
successfully use DataImportHandler to import this data into Apache Solr.
However, one of the column store the location of PDF file.  How can I
configure DataImportHandler to use ExtractingRequestHandler to extract the
content of the PDF?

Thanks!

Khai Doan

How to use DataImportHandler with ExtractingRequestHandler?

2009-09-02 Thread Khai Doan

Hi all,

My name is Khai.  I have a table in a relational database.  I have
successfully use DataImportHandler to import this data into Apache Solr.
However, one of the column store the location of PDF file.  How can I
configure DataImportHandler to use ExtractingRequestHandler to extract the
content of the PDF?

Thanks!

Khai Doan

Re: How to use DataImportHandler with ExtractingRequestHandler?

2009-09-02 Thread Noble Paul നോബിള്‍ नोब्ळ्

unfortunately DIH is not yet integrated with ExtractingRequestHandler .
see this https://issues.apache.org/jira/browse/SOLR-1358



On Thu, Sep 3, 2009 at 5:34 AM, Khai Doankhaitd...@gmail.com wrote:
 Hi all,

 My name is Khai.  I have a table in a relational database.  I have
 successfully use DataImportHandler to import this data into Apache Solr.
 However, one of the column store the location of PDF file.  How can I
 configure DataImportHandler to use ExtractingRequestHandler to extract the
 content of the PDF?

 Thanks!

 Khai Doan




-- 
-
Noble Paul | Principal Engineer| AOL | http://aol.com

Re: How to use DataImportHandler with ExtractingRequestHandler?

Re: How to use DataImportHandler with ExtractingRequestHandler?

Re: How to use DataImportHandler with ExtractingRequestHandler?

Re: How to use DataImportHandler with ExtractingRequestHandler?

How to use DataImportHandler with ExtractingRequestHandler?

Re: How to use DataImportHandler with ExtractingRequestHandler?

6 matches

Site Navigation

Mail list logo

Footer information