Hi,

On Fri, Aug 13, 2010 at 9:29 AM, Sergiy Shyrkov
<[email protected]> wrote:
> half a year ago I have also asked this question
> (http://tika.markmail.org/message/c7lbr4zu62d6ulwl ), but as I got no
> answer, my solution was to use WriteOutContentHandler and set the
> writeLimit (character limit) to 0.
>
> Could you, please, advice if using org.xml.sax.helpers.DefaultHandler
> instead is a better solution?

Using the WriteOutContentHandler with writeLimit set to 0 might even
be a better solution for your case. A DefaultHandler will simply
ignore all extracted content, but the parser will still be parsing
through the entire document. The WriteLimitReachedException thrown by
a WriteOutContentHandler will terminate the parsing process as soon as
the write limit is reached.

The benefit is that for most document types this means that the parser
doesn't need to process the entire input document. The downside is
that not all document types have all the metadata available at the
beginning of the file, so terminating the parsing process early may
cost you some pieces of metadata.

BR,

Jukka Zitting

Reply via email to