Hello Jukka,

thank you for the detailed explanation!
Now, I see that the WriteLimitReachedException approach has issues with document types, where the metadata is inside the content.
I can think of HTML files in a first turn.
That is why if there will be plans for an improvement to support only metadata parsing (directly in the Tika API) I would vote +1 for it :-)

Kind regards
Sergiy Shyrkov


On 13.08.2010 10:12, Jukka Zitting wrote:
Hi,

On Fri, Aug 13, 2010 at 9:29 AM, Sergiy Shyrkov
<[email protected]>  wrote:
half a year ago I have also asked this question
(http://tika.markmail.org/message/c7lbr4zu62d6ulwl ), but as I got no
answer, my solution was to use WriteOutContentHandler and set the
writeLimit (character limit) to 0.

Could you, please, advice if using org.xml.sax.helpers.DefaultHandler
instead is a better solution?
Using the WriteOutContentHandler with writeLimit set to 0 might even
be a better solution for your case. A DefaultHandler will simply
ignore all extracted content, but the parser will still be parsing
through the entire document. The WriteLimitReachedException thrown by
a WriteOutContentHandler will terminate the parsing process as soon as
the write limit is reached.

The benefit is that for most document types this means that the parser
doesn't need to process the entire input document. The downside is
that not all document types have all the metadata available at the
beginning of the file, so terminating the parsing process early may
cost you some pieces of metadata.

BR,

Jukka Zitting

Reply via email to