I am looking at the versions supported by newer version of Tika (1.3) and was
not sure what version(s) of the Microsoft office it supports
(97/2000/2010/2013) for each of the below?
http://tika.apache.org/1.3/formats.html#Microsoft_Office_document_formats
Microsoft word (also does it support
Apache Tika:-You can Use to Extract text from PDF,word Documents.
It internally uses Apache POI for Extraction of text from office documents..
It uses PDFBOX for Extraction of text from PDF Documents..
On Sat, Jan 26, 2013 at 4:24 AM, saisantoshi saisantosh...@gmail.com wrote:
I want to
Have you tried using the PDFParser [1] and the OfficeParser [2]
classes from Tika?
This question seems to be more appropriate for the Tika user mailing list [3]?
[1]
http://tika.apache.org/1.3/api/org/apache/tika/parser/pdf/PDFParser.html#parse(java.io.InputStream,
org.xml.sax.ContentHandler,
You may be able to use Tika directly without needing to choose the specific
classes, although the latter may give you the specific data you need without
the extra overhead.
You could take a look at the Solr Extracting Request Handler source for an
example:
We are not using Solr and using just Lucene core 4.0 engine. I am trying to
see if we can use tika library to extract textual information from
pdf/word/excel documents. I am mainly interested in reading the contents
inside the documents and index using lucene. My question here is , is tika
Re-read my last message - and then take a look at that Solr source code,
which will give you an idea how to use Tika, even though you are using
Lucene only. If you have specific questions, please be specific.
To answer your latest question, yes, Tika is good enough. Solr
/update/extract uses
I want to index the document content( such as PDF/word/excel) and am just
wondering if there are any good readers that I can use to integrate into
Lucene 4.0. Any pointers/example code is appreciated..
Lucene In Action book mentions tika as the library to use but not sure if
this is the preferred