Re: Exception :org.apache.pdfbox.filter.FlateFilter - Stop reading corrupt stream

Andreas Lehmkuehler Mon, 19 Mar 2012 00:12:55 -0700

Hi,

Am 18.03.2012 09:05, schrieb Cool The Breezer:

- one of your PDFs may be corrupt, try to find out if the exception occurs when 
processing the very same document

I can parse the same PDF file without any issue but in multi-threaded 
environment, after parsing 200 odd files, I keep on getting this exception and 
none of files parsed successfully. Then I had to forcefully stop parser.

Hmmm, PDFBox isn't supposed to be threadsafe, so that could be the problem.

- you ran into an issue which was resolved in the current trunk [1]

I have not tried current trunk and I just downloaded latest binary files i.e. v 
1.6.0.

- OutOfMememory

I never get OutOfMememory as I have around 8GB ram in my Mac and I set max ram 
while parsing.

You probably won't see the excpetion as it is swallowed.

I reread your code and you might change it to something like that:

..
PDDocument document = new PDDocument(instream);
PDFTextStripper stripper = new PDFTextStripper();
String str = stripper.getText(document);
...

You don't need your own PDFParser.

regards,
RB


________________________________
  From: ""Andreas Lehmkühler""<[email protected]>
To: Cool The Breezer<[email protected]>
Sent: Friday, March 16, 2012 1:42 AM
Subject: Re: Exception :org.apache.pdfbox.filter.FlateFilter - Stop reading 
corrupt stream


Hi,

Cool The Breezer<[email protected]>  hat am 15. März 2012 um 07:38 
geschrieben:

Hello Group,
                         I recently downloaded PDFBox 1.6.0. I using to parse 
PDF files as URL in a multi-threaded environment, max 4 thread. It works fine 
for ~200 odd files and then displays following excpetion
org.apache.pdfbox.filter.FlateFilter - Stop reading corrupt stream
I am using pdfbox in Max OSX lion. I am using following code

URL url = new URL( filePath );
URLConnection urlConn = url.openConnection();
InputStream inStream = urlConn.getInputStream();
PDFParser pdfParser = new PDFParser(inStream);
pdfParser.parse();
document = new PDDocument(pdfParser.getDocument());
PDFTextStripper stripper = new PDFTextStripper();
String str = stripper.getText(document);

inStream.close();
output.close();
document.close();


There may be a couple of different reasons for that. The version you are using 
swallows the origin exception.

- one of your PDFs may be corrupt, try to find out if the exception occurs when 
processing the very same document
- you ran into an issue which was resolved in the current trunk [1]
- OutOfMememory


In addition to the above error, I am getting ERROR 
org.apache.pdfbox.pdmodel.font.PDCIDFont - Error: Could not parse predefined 
CMAP file for 'Adobe--UCS2' error but that does not stop the parser to extract 
text so I am ignoring this error. Please suggest me any work around.

regards,
RB

BR
Andreas Lehmkühler

[1] https://issues.apache.org/jira/browse/PDFBOX-1232


BR
Andreas Lehmkühler

Re: Exception :org.apache.pdfbox.filter.FlateFilter - Stop reading corrupt stream

Reply via email to