Hi folks (apologies- hit send too soon)

We run pdfbox for pdf text extraction under the Dspace application.

Occasionally we get the odd failure, and we’re investigating some errors just 
now. I’m just wondering what property of the PDF in question it’s looking at 
here, and if there’s any way we can mitigate against that. It’s certainly not 
the title.


One is:

java.lang.RuntimeException: java.io.IOException: Not a number: +

java.lang.RuntimeException: java.io.IOException: Not a number: +

at 
org.apache.pdfbox.pdfparser.PDFStreamParser$1.tryNext(PDFStreamParser.java:178)

at 
org.apache.pdfbox.pdfparser.PDFStreamParser$1.hasNext(PDFStreamParser.java:187)

at 
org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:266)

at 
org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:251)

at 
org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:225)

at org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:442)

at org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:366)

at org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:322)

at org.dspace.app.mediafilter.PDFFilter.getDestinationStream(PDFFilter.java:101)


And here’s another:


java.lang.NumberFormatException: For input string: "dup"

java.lang.NumberFormatException: For input string: "dup"

at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)

at java.lang.Integer.parseInt(Integer.java:492)

at java.lang.Integer.parseInt(Integer.java:527)

at 
org.apache.pdfbox.pdmodel.font.PDType1Font.getEncodingFromFont(PDType1Font.java:344)

at 
org.apache.pdfbox.pdmodel.font.PDType1Font.determineEncoding(PDType1Font.java:280)

at org.apache.pdfbox.pdmodel.font.PDFont.<init>(PDFont.java:181)

at org.apache.pdfbox.pdmodel.font.PDSimpleFont.<init>(PDSimpleFont.java:83)

at org.apache.pdfbox.pdmodel.font.PDType1Font.<init>(PDType1Font.java:152)

at 
org.apache.pdfbox.pdmodel.font.PDFontFactory.createFont(PDFontFactory.java:108)

at org.apache.pdfbox.pdmodel.font.PDFontFactory.createFont(PDFontFactory.java:

5)

at org.apache.pdfbox.pdmodel.PDResources.getFonts(PDResources.java:115)

Thanks
Scott
--
Scott Renton
Digital Development
Library and University Collections
Argyle House, Floor F
ext: 515219

The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to