Hi, On Fri, Aug 13, 2010 at 9:29 AM, Sergiy Shyrkov <[email protected]> wrote: > half a year ago I have also asked this question > (http://tika.markmail.org/message/c7lbr4zu62d6ulwl ), but as I got no > answer, my solution was to use WriteOutContentHandler and set the > writeLimit (character limit) to 0. > > Could you, please, advice if using org.xml.sax.helpers.DefaultHandler > instead is a better solution?
Using the WriteOutContentHandler with writeLimit set to 0 might even be a better solution for your case. A DefaultHandler will simply ignore all extracted content, but the parser will still be parsing through the entire document. The WriteLimitReachedException thrown by a WriteOutContentHandler will terminate the parsing process as soon as the write limit is reached. The benefit is that for most document types this means that the parser doesn't need to process the entire input document. The downside is that not all document types have all the metadata available at the beginning of the file, so terminating the parsing process early may cost you some pieces of metadata. BR, Jukka Zitting
