Hi folks! Sorry, this is my first posting on this mailing list and well, errrr I had some interesting experiences with PDFBOX today :-\
Well, i'm working on some Alfresco project currently, where Alfresco (a document management system) employs Lucene as full text search engine and PDFBoX as converter from PDF to plain text to feed Lucene. Then we realized that some 5% of our PDF documents yielded some innocent message in Alfresco's log file: ERROR [pdfbox.filter.FlateFilter] Stop reading corrupt stream See http://forums.alfresco.com/en/viewtopic.php?f=8&t=24033&p=81641 for the full thread. Some digging into the PDFBOXs source code yielded this piece of code: FlateFilter:128 ff. try { // decoding not needed while ((amountRead = decompressor.read(buffer, 0, Math.min(mayRead,BUFFER_SIZE))) != -1) { result.write(buffer, 0, amountRead); } } catch (OutOfMemoryError exception) { // if the stream is corrupt an OutOfMemoryError may occur log.error("Stop reading corrupt stream"); } catch (ZipException exception) { // if the stream is corrupt an OutOfMemoryError may occur log.error("Stop reading corrupt stream"); } catch (EOFException exception) { // if the stream is corrupt an OutOfMemoryError may occur log.error("Stop reading corrupt stream"); } which i consider really bad for two reasons: - the failure to properly decode the PDF is hidden from the caller, so we never get a hint that the document was only partially decoded. As a result, we get an incomplete Lucene index! - the OutOfMemoryError should NEVER EVER be caught and discarded this way, as it might leave my application in an instable state. When my application is out of memory, i'm busted. And at least, i'd like to know when i'm busted ;-) Conclusion: If an Exception occurs, report it to the caller. And even better, fix the decoder to properly read all PDFs in the universe, but i guess that is the harder part :-) Cheers Andreas

