Re: How to extract only metadata

Sergiy Shyrkov Fri, 13 Aug 2010 00:30:18 -0700

 Hello,

half a year ago I have also asked this question(http://tika.markmail.org/message/c7lbr4zu62d6ulwl ), but as I got noanswer,my solution was to use WriteOutContentHandler and set the writeLimit(character limit) to 0.

Could you, please, advice if using org.xml.sax.helpers.DefaultHandlerinstead is a better solution?I mostly worrying about performance and memory footprint as metadataextraction is done synchronously on our side whereas complete

text extraction in a background job.

Thank you in advance!

Kind regards
Sergiy Shyrkov



On 12.08.2010 15:43, Jukka Zitting wrote:

Hi,

On Thu, Aug 12, 2010 at 3:30 PM, Sergiy Karpenko
<[email protected]>  wrote:

As I know,  Parser extracts metadata and file content.

parse(stream, contentHandler, metadata, parseContext);

But full content extraction is redundant if I want extract only metadata.

How can I do this with Tika?

Just pass a "new org.xml.sax.helpers.DefaultHandler()" as the content
handler argument to the parse() method. Tika will still do content
extraction along with metadata extraction, but the text content is
simply ignored.

There was earlier some discussion about adding some specific dummy
ContentHandler instance (or allowing a null handler) that would inform
the underlying parsers that the client is only interested in the
document metadata. So far the need for such an optimization has not
been too pressing, so we haven't yet implemented that.

BR,

Jukka Zitting

Re: How to extract only metadata

Reply via email to