Hi,
Am 18.03.2012 09:05, schrieb Cool The Breezer:
- one of your PDFs may be corrupt, try to find out if the exception occurs when
processing the very same document
I can parse the same PDF file without any issue but in multi-threaded
environment, after parsing 200 odd files, I keep on getting this exception and
none of files parsed successfully. Then I had to forcefully stop parser.
Hmmm, PDFBox isn't supposed to be threadsafe, so that could be the problem.
- you ran into an issue which was resolved in the current trunk [1]
I have not tried current trunk and I just downloaded latest binary files i.e. v
1.6.0.
- OutOfMememory
I never get OutOfMememory as I have around 8GB ram in my Mac and I set max ram
while parsing.
You probably won't see the excpetion as it is swallowed.
I reread your code and you might change it to something like that:
..
PDDocument document = new PDDocument(instream);
PDFTextStripper stripper = new PDFTextStripper();
String str = stripper.getText(document);
...
You don't need your own PDFParser.
regards,
RB
________________________________
From: ""Andreas Lehmkühler""<[email protected]>
To: Cool The Breezer<[email protected]>
Sent: Friday, March 16, 2012 1:42 AM
Subject: Re: Exception :org.apache.pdfbox.filter.FlateFilter - Stop reading
corrupt stream
Hi,
Cool The Breezer<[email protected]> hat am 15. März 2012 um 07:38
geschrieben:
Hello Group,
I recently downloaded PDFBox 1.6.0. I using to parse
PDF files as URL in a multi-threaded environment, max 4 thread. It works fine
for ~200 odd files and then displays following excpetion
org.apache.pdfbox.filter.FlateFilter - Stop reading corrupt stream
I am using pdfbox in Max OSX lion. I am using following code
URL url = new URL( filePath );
URLConnection urlConn = url.openConnection();
InputStream inStream = urlConn.getInputStream();
PDFParser pdfParser = new PDFParser(inStream);
pdfParser.parse();
document = new PDDocument(pdfParser.getDocument());
PDFTextStripper stripper = new PDFTextStripper();
String str = stripper.getText(document);
inStream.close();
output.close();
document.close();
There may be a couple of different reasons for that. The version you are using
swallows the origin exception.
- one of your PDFs may be corrupt, try to find out if the exception occurs when
processing the very same document
- you ran into an issue which was resolved in the current trunk [1]
- OutOfMememory
In addition to the above error, I am getting ERROR
org.apache.pdfbox.pdmodel.font.PDCIDFont - Error: Could not parse predefined
CMAP file for 'Adobe--UCS2' error but that does not stop the parser to extract
text so I am ignoring this error. Please suggest me any work around.
regards,
RB
BR
Andreas Lehmkühler
[1] https://issues.apache.org/jira/browse/PDFBOX-1232
BR
Andreas Lehmkühler