>- one of your PDFs may be corrupt, try to find out if the exception occurs 
>when processing the very same document
I can parse the same PDF file without any issue but in multi-threaded 
environment, after parsing 200 odd files, I keep on getting this exception and 
none of files parsed successfully. Then I had to forcefully stop parser.
>- you ran into an issue which was resolved in the current trunk [1] 
I have not tried current trunk and I just downloaded latest binary files i.e. v 
1.6.0.
>- OutOfMememory
I never get OutOfMememory as I have around 8GB ram in my Mac and I set max ram 
while parsing.

regards,
RB


________________________________
 From: ""Andreas Lehmkühler"" <[email protected]>
To: Cool The Breezer <[email protected]> 
Sent: Friday, March 16, 2012 1:42 AM
Subject: Re: Exception :org.apache.pdfbox.filter.FlateFilter - Stop reading 
corrupt stream
 

Hi,

Cool The Breezer <[email protected]> hat am 15. März 2012 um 07:38 
geschrieben: 

> Hello Group, 
>                         I recently downloaded PDFBox 1.6.0. I using to parse 
> PDF files as URL in a multi-threaded environment, max 4 thread. It works fine 
> for ~200 odd files and then displays following excpetion 
> org.apache.pdfbox.filter.FlateFilter - Stop reading corrupt stream 
> I am using pdfbox in Max OSX lion. I am using following code 
> 
> URL url = new URL( filePath ); 
> URLConnection urlConn = url.openConnection(); 
> InputStream inStream = urlConn.getInputStream(); 
> PDFParser pdfParser = new PDFParser(inStream); 
> pdfParser.parse(); 
> document = new PDDocument(pdfParser.getDocument()); 
> PDFTextStripper stripper = new PDFTextStripper(); 
> String str = stripper.getText(document); 
> 
> inStream.close();  
> output.close(); 
> document.close(); 
 
There may be a couple of different reasons for that. The version you are using 
swallows the origin exception. 
 
- one of your PDFs may be corrupt, try to find out if the exception occurs when 
processing the very same document
- you ran into an issue which was resolved in the current trunk [1] 
- OutOfMememory
 
> 
> In addition to the above error, I am getting ERROR 
> org.apache.pdfbox.pdmodel.font.PDCIDFont - Error: Could not parse predefined 
> CMAP file for 'Adobe--UCS2' error but that does not stop the parser to 
> extract text so I am ignoring this error. Please suggest me any work around. 
> 
> regards, 
> RB 
BR
Andreas Lehmkühler
 
[1] https://issues.apache.org/jira/browse/PDFBOX-1232 

Reply via email to