Hi all,
I just ran into a problem where a truncated zip file is causing the
ZipParser to hang.
The file was truncated because I'd configured Bixo to only fetch the
first 65K of a file, to avoid problems caused by huge files.
This is common practice for web crawlers, but it means that I need to
know which parsers can handle truncated content.
E.g. text is fine, HTML seems to be OK (based on my prior Nutch
experience with NekoHTML).
XML is not fine, from what I've seen - the parser will fail if it runs
into the end of document before finishing the parse.
And binary formats like zip, pdf, etc are definitely not OK with
truncation.
So it seems like I'd want to have a parser call that returns back info
about whether the parser can handle truncated content - e.g.
boolean truncatedOK(MimeType inputType);
As a stop-gap, I could assume that non-XML text was OK, and everything
else was no-go for truncated content.
Thoughts on this?
Thanks,
-- Ken
--------------------------
Ken Krugler
TransPac Software, Inc.
<http://www.transpac.com>
+1 530-210-6378