Andreas,
you are right; catching the exception and not raising it to the caller
is a problem. I would suggest to file this as a report in JIRA:
https://issues.apache.org/jira/browse/PDFBOX
Greetings,
Erik
[email protected] wrote:
Hi folks!
Sorry, this is my first posting on this mailing list and well, errrr I had some
interesting experiences with PDFBOX today :-\
Well, i'm working on some Alfresco project currently, where Alfresco (a
document management system) employs Lucene as full text search engine and
PDFBoX as converter from PDF to plain text to feed Lucene.
Then we realized that some 5% of our PDF documents yielded some innocent
message in Alfresco's log file:
ERROR [pdfbox.filter.FlateFilter] Stop reading corrupt stream
See http://forums.alfresco.com/en/viewtopic.php?f=8&t=24033&p=81641 for the
full thread.
Some digging into the PDFBOXs source code yielded this piece of code:
FlateFilter:128 ff.
try
{
// decoding not needed
while ((amountRead = decompressor.read(buffer, 0,
Math.min(mayRead,BUFFER_SIZE))) != -1)
{
result.write(buffer, 0, amountRead);
}
}
catch (OutOfMemoryError exception)
{
// if the stream is corrupt an OutOfMemoryError may
occur
log.error("Stop reading corrupt stream");
}
catch (ZipException exception)
{
// if the stream is corrupt an OutOfMemoryError may
occur
log.error("Stop reading corrupt stream");
}
catch (EOFException exception)
{
// if the stream is corrupt an OutOfMemoryError may
occur
log.error("Stop reading corrupt stream");
}
which i consider really bad for two reasons:
- the failure to properly decode the PDF is hidden from the caller, so we never
get a hint that the document was only partially decoded. As a result, we get an
incomplete Lucene index!
- the OutOfMemoryError should NEVER EVER be caught and discarded this way, as
it might leave my application in an instable state. When my application is out
of memory, i'm busted. And at least, i'd like to know when i'm busted ;-)
Conclusion: If an Exception occurs, report it to the caller. And even better,
fix the decoder to properly read all PDFs in the universe, but i guess that is
the harder part :-)
Cheers
Andreas