Problem with PDF to text conversion

aw Wed, 17 Feb 2010 13:57:51 -0800

Hi folks!

Sorry, this is my first posting on this mailing list and well, errrr I had some 
interesting experiences with PDFBOX today :-\



Well, i'm working on some Alfresco project currently, where Alfresco (a 
document management system) employs Lucene as full text search engine and 
PDFBoX as converter from PDF to plain text to feed Lucene.
Then we realized that some 5% of our PDF documents yielded some innocent 
message in Alfresco's log file:

ERROR [pdfbox.filter.FlateFilter] Stop reading corrupt stream

See http://forums.alfresco.com/en/viewtopic.php?f=8&t=24033&p=81641 for the 
full thread.

Some digging into the PDFBOXs source code yielded this piece of code:

FlateFilter:128 ff.

                    try 
                    {
                        // decoding not needed
                        while ((amountRead = decompressor.read(buffer, 0, 
Math.min(mayRead,BUFFER_SIZE))) != -1)
                        {
                            result.write(buffer, 0, amountRead);
                        }
                    }
                    catch (OutOfMemoryError exception) 
                    {
                        // if the stream is corrupt an OutOfMemoryError may 
occur
                        log.error("Stop reading corrupt stream");
                    }
                    catch (ZipException exception) 
                    {
                        // if the stream is corrupt an OutOfMemoryError may 
occur
                        log.error("Stop reading corrupt stream");
                    }
                    catch (EOFException exception) 
                    {
                        // if the stream is corrupt an OutOfMemoryError may 
occur
                        log.error("Stop reading corrupt stream");
                    }

which i consider really bad for two reasons:

- the failure to properly decode the PDF is hidden from the caller, so we never 
get a hint that the document was only partially decoded. As a result, we get an 
incomplete Lucene index!

- the OutOfMemoryError should NEVER EVER be caught and discarded this way, as 
it might leave my application in an instable state. When my application is out 
of memory, i'm busted. And at least, i'd like to know when i'm busted ;-)

Conclusion: If an Exception occurs, report it to the caller. And even better, 
fix the decoder to properly read all PDFs in the universe, but i guess that is 
the harder part :-)


Cheers
Andreas

Problem with PDF to text conversion

Reply via email to