Re: Readers for extracting textual info from pd/doc/excel for indexing the actual content

2013-02-05 Thread saisantoshi
I am looking at the versions supported by newer version of Tika (1.3) and was not sure what version(s) of the Microsoft office it supports (97/2000/2010/2013) for each of the below? http://tika.apache.org/1.3/formats.html#Microsoft_Office_document_formats Microsoft word (also does it support

Re: Readers for extracting textual info from pd/doc/excel for indexing the actual content

2013-01-28 Thread VIGNESH S
Apache Tika:-You can Use to Extract text from PDF,word Documents. It internally uses Apache POI for Extraction of text from office documents.. It uses PDFBOX for Extraction of text from PDF Documents.. On Sat, Jan 26, 2013 at 4:24 AM, saisantoshi saisantosh...@gmail.com wrote: I want to

Re: Readers for extracting textual info from pd/doc/excel for indexing the actual content

2013-01-27 Thread Adrien Grand
Have you tried using the PDFParser [1] and the OfficeParser [2] classes from Tika? This question seems to be more appropriate for the Tika user mailing list [3]? [1] http://tika.apache.org/1.3/api/org/apache/tika/parser/pdf/PDFParser.html#parse(java.io.InputStream, org.xml.sax.ContentHandler,

Re: Readers for extracting textual info from pd/doc/excel for indexing the actual content

2013-01-27 Thread Jack Krupansky
You may be able to use Tika directly without needing to choose the specific classes, although the latter may give you the specific data you need without the extra overhead. You could take a look at the Solr Extracting Request Handler source for an example:

Re: Readers for extracting textual info from pd/doc/excel for indexing the actual content

2013-01-27 Thread saisantoshi
We are not using Solr and using just Lucene core 4.0 engine. I am trying to see if we can use tika library to extract textual information from pdf/word/excel documents. I am mainly interested in reading the contents inside the documents and index using lucene. My question here is , is tika

Re: Readers for extracting textual info from pd/doc/excel for indexing the actual content

2013-01-27 Thread Jack Krupansky
Re-read my last message - and then take a look at that Solr source code, which will give you an idea how to use Tika, even though you are using Lucene only. If you have specific questions, please be specific. To answer your latest question, yes, Tika is good enough. Solr /update/extract uses

Readers for extracting textual info from pd/doc/excel for indexing the actual content

2013-01-25 Thread saisantoshi
I want to index the document content( such as PDF/word/excel) and am just wondering if there are any good readers that I can use to integrate into Lucene 4.0. Any pointers/example code is appreciated.. Lucene In Action book mentions tika as the library to use but not sure if this is the preferred