Hello,

half a year ago I have also asked this question (http://tika.markmail.org/message/c7lbr4zu62d6ulwl ), but as I got no answer, my solution was to use WriteOutContentHandler and set the writeLimit (character limit) to 0.

Could you, please, advice if using org.xml.sax.helpers.DefaultHandler instead is a better solution? I mostly worrying about performance and memory footprint as metadata extraction is done synchronously on our side whereas complete
text extraction in a background job.

Thank you in advance!

Kind regards
Sergiy Shyrkov



On 12.08.2010 15:43, Jukka Zitting wrote:
Hi,

On Thu, Aug 12, 2010 at 3:30 PM, Sergiy Karpenko
<[email protected]>  wrote:
As I know,  Parser extracts metadata and file content.

parse(stream, contentHandler, metadata, parseContext);

But full content extraction is redundant if I want extract only metadata.

How can I do this with Tika?
Just pass a "new org.xml.sax.helpers.DefaultHandler()" as the content
handler argument to the parse() method. Tika will still do content
extraction along with metadata extraction, but the text content is
simply ignored.

There was earlier some discussion about adding some specific dummy
ContentHandler instance (or allowing a null handler) that would inform
the underlying parsers that the client is only interested in the
document metadata. So far the need for such an optimization has not
been too pressing, so we haven't yet implemented that.

BR,

Jukka Zitting

Reply via email to