Re: Problem with PDF to text conversion

Erik Scholtz, ArgonSoft GmbH Thu, 18 Feb 2010 00:40:13 -0800

Andreas,

you are right; catching the exception and not raising it to the calleris a problem. I would suggest to file this as a report in JIRA:


https://issues.apache.org/jira/browse/PDFBOX

Greetings,
Erik

[email protected] wrote:

Hi folks!

Sorry, this is my first posting on this mailing list and well, errrr I had some 
interesting experiences with PDFBOX today :-\


Well, i'm working on some Alfresco project currently, where Alfresco (a 
document management system) employs Lucene as full text search engine and 
PDFBoX as converter from PDF to plain text to feed Lucene.
Then we realized that some 5% of our PDF documents yielded some innocent 
message in Alfresco's log file:

ERROR [pdfbox.filter.FlateFilter] Stop reading corrupt stream

See http://forums.alfresco.com/en/viewtopic.php?f=8&t=24033&p=81641 for the 
full thread.

Some digging into the PDFBOXs source code yielded this piece of code:

FlateFilter:128 ff.

try{

                        // decoding not needed
                        while ((amountRead = decompressor.read(buffer, 0, 
Math.min(mayRead,BUFFER_SIZE))) != -1)
                        {
                            result.write(buffer, 0, amountRead);
                        }
                    }

catch (OutOfMemoryError exception){

                        // if the stream is corrupt an OutOfMemoryError may 
occur
                        log.error("Stop reading corrupt stream");
                    }

catch (ZipException exception){

                        // if the stream is corrupt an OutOfMemoryError may 
occur
                        log.error("Stop reading corrupt stream");
                    }

catch (EOFException exception){

                        // if the stream is corrupt an OutOfMemoryError may 
occur
                        log.error("Stop reading corrupt stream");
                    }

which i consider really bad for two reasons:

- the failure to properly decode the PDF is hidden from the caller, so we never 
get a hint that the document was only partially decoded. As a result, we get an 
incomplete Lucene index!

- the OutOfMemoryError should NEVER EVER be caught and discarded this way, as 
it might leave my application in an instable state. When my application is out 
of memory, i'm busted. And at least, i'd like to know when i'm busted ;-)

Conclusion: If an Exception occurs, report it to the caller. And even better, 
fix the decoder to properly read all PDFs in the universe, but i guess that is 
the harder part :-)


Cheers
Andreas

Re: Problem with PDF to text conversion

Reply via email to