Hi all,

I just ran into a problem where a truncated zip file is causing the ZipParser to hang.

The file was truncated because I'd configured Bixo to only fetch the first 65K of a file, to avoid problems caused by huge files.

This is common practice for web crawlers, but it means that I need to know which parsers can handle truncated content.

E.g. text is fine, HTML seems to be OK (based on my prior Nutch experience with NekoHTML).

XML is not fine, from what I've seen - the parser will fail if it runs into the end of document before finishing the parse.

And binary formats like zip, pdf, etc are definitely not OK with truncation.

So it seems like I'd want to have a parser call that returns back info about whether the parser can handle truncated content - e.g.

boolean truncatedOK(MimeType inputType);

As a stop-gap, I could assume that non-XML text was OK, and everything else was no-go for truncated content.

Thoughts on this?

Thanks,

-- Ken

--------------------------
Ken Krugler
TransPac Software, Inc.
<http://www.transpac.com>
+1 530-210-6378

Reply via email to