Re: Info from parser on handling partial input

Ken Krugler Sat, 10 Oct 2009 07:21:43 -0700

Hi Jukka,

On Thu, Oct 8, 2009 at 6:52 PM, Ken Krugler <kkrug...@transpac.com>wrote:

I just ran into a problem where a truncated zip file is causing the
ZipParser to hang.


Does it hang (i.e. never return), or throw a TikaException? The former
would be a clear bug, the latter expected behaviour given that the
file cannot be parsed.


It hung.

The file was truncated because I'd configured Bixo to only fetchthe first
65K of a file, to avoid problems caused by huge files.
This is common practice for web crawlers, but it means that I needto know
which parsers can handle truncated content.
There's a somewhat related feature request TIKA-261, that approaches
this issue from a slightly different angle.

E.g. text is fine, HTML seems to be OK (based on my prior Nutchexperience
with NekoHTML).
XML is not fine, from what I've seen - the parser will fail if itruns into
the end of document before finishing the parse.


If the truncated stream ends with a -1 return from read(), then I
would expect the XML parser to throw a TikaException to signify a
parse failure. If the streams throws an IOException to signify
truncation, then the parser should propagate that exception up to the
caller.

The latter behavior suggests a way to cleanly implement the feature
you're asking for. The given input stream could be wrapped into a
decorator that throws a tagged IOException when the given size limit
has been reached. A parser can capture such exceptions and cleanly
close the emitted XHTML stream, potentially adding a metadata entry
that signifies that the extracted text has been truncated.

Interesting idea. I'll need to capture a bunch of truncated zip filesto test.


I filed https://issues.apache.org/jira/browse/TIKA-307 to capture this.

-- Ken

Re: Info from parser on handling partial input

Reply via email to