Hi, On Thu, Oct 8, 2009 at 7:34 PM, Hanssens Bart <bart.hanss...@fedict.be> wrote: > Some zips might be OK: if one manages to get at least one zipentry > before hitting the 64 K limit (say xml-in-zip formats like ODF, OOXML, > ePUB), it should be possible to index it partially.
It's possible to read ZIP files in streaming mode, but see the caveats listed in [1]. The current ZipParser in Tika does use the streaming even though the result may be incorrect. Once TIKA-153 is solved, we should be able to automatically switch to more correct parsing when the full input document is available in random-access mode. [1] http://commons.apache.org/compress/zip.html BR, Jukka Zitting