Info from parser on handling partial input

Hi all,

I just ran into a problem where a truncated zip file is causing theZipParser to hang.

The file was truncated because I'd configured Bixo to only fetch thefirst 65K of a file, to avoid problems caused by huge files.

This is common practice for web crawlers, but it means that I need toknow which parsers can handle truncated content.

E.g. text is fine, HTML seems to be OK (based on my prior Nutchexperience with NekoHTML).

XML is not fine, from what I've seen - the parser will fail if it runsinto the end of document before finishing the parse.

And binary formats like zip, pdf, etc are definitely not OK withtruncation.

So it seems like I'd want to have a parser call that returns back infoabout whether the parser can handle truncated content - e.g.


boolean truncatedOK(MimeType inputType);

As a stop-gap, I could assume that non-XML text was OK, and everythingelse was no-go for truncated content.


Thoughts on this?

Thanks,

-- Ken

--------------------------
Ken Krugler
TransPac Software, Inc.
<http://www.transpac.com>
+1 530-210-6378

Reply via email to