Hi, On Thu, Oct 8, 2009 at 6:52 PM, Ken Krugler <kkrug...@transpac.com> wrote: > I just ran into a problem where a truncated zip file is causing the > ZipParser to hang.
Does it hang (i.e. never return), or throw a TikaException? The former would be a clear bug, the latter expected behaviour given that the file cannot be parsed. > The file was truncated because I'd configured Bixo to only fetch the first > 65K of a file, to avoid problems caused by huge files. > > This is common practice for web crawlers, but it means that I need to know > which parsers can handle truncated content. There's a somewhat related feature request TIKA-261, that approaches this issue from a slightly different angle. > E.g. text is fine, HTML seems to be OK (based on my prior Nutch experience > with NekoHTML). > > XML is not fine, from what I've seen - the parser will fail if it runs into > the end of document before finishing the parse. If the truncated stream ends with a -1 return from read(), then I would expect the XML parser to throw a TikaException to signify a parse failure. If the streams throws an IOException to signify truncation, then the parser should propagate that exception up to the caller. The latter behavior suggests a way to cleanly implement the feature you're asking for. The given input stream could be wrapped into a decorator that throws a tagged IOException when the given size limit has been reached. A parser can capture such exceptions and cleanly close the emitted XHTML stream, potentially adding a metadata entry that signifies that the extracted text has been truncated. > And binary formats like zip, pdf, etc are definitely not OK with truncation. > > So it seems like I'd want to have a parser call that returns back info about > whether the parser can handle truncated content - e.g. > > boolean truncatedOK(MimeType inputType); A somewhat related issue is TIKA-153, that asks for a way to pass full files or memory buffers to a parser. A truncatedOK() method would essentially tell whether parser will benefit from having such access to the complete input document. BR, Jukka Zitting