Hi,

On Thu, Aug 12, 2010 at 3:30 PM, Sergiy Karpenko
<[email protected]> wrote:
> As I know,  Parser extracts metadata and file content.
>
> parse(stream, contentHandler, metadata, parseContext);
>
> But full content extraction is redundant if I want extract only metadata.
>
> How can I do this with Tika?

Just pass a "new org.xml.sax.helpers.DefaultHandler()" as the content
handler argument to the parse() method. Tika will still do content
extraction along with metadata extraction, but the text content is
simply ignored.

There was earlier some discussion about adding some specific dummy
ContentHandler instance (or allowing a null handler) that would inform
the underlying parsers that the client is only interested in the
document metadata. So far the need for such an optimization has not
been too pressing, so we haven't yet implemented that.

BR,

Jukka Zitting

Reply via email to