Hello,

I am using Tika not just for extracting text content from the files
(embedded or not), but I also need to extract embedded files into
separate location on file system, to have them aaccessible as
top-level files.

And because parsed files can be very big, I had to do this file
extraction in streaming fashion, meaning, within my custom
recursiveParser, that is ParserDecorator based on idea from Tika wiki,
for each embedded file I create my own
ByteCollectionInputStreamWrapper which is InputStream decorator that
pushes extracted files' bytes to some disk location during the very
process of Tika parsers reading them. And everything works fine and
very efficiently.

BUT, the problem is when parsing error occurs, such as Tika trying to
parse password-protected archive for example. The whole
org.apache.tika.parser.Parser#parse method fails by raising the
exception, and reading and consequently extracting bytes from file's
InputStream is interrupted, so my extraction to separate file location
is broken.

Does anyone has some diea how to collect bytes in such error situations?

Regards,
Vjeran

Reply via email to