Hi folks (apologies- hit send too soon) We run pdfbox for pdf text extraction under the Dspace application.
Occasionally we get the odd failure, and we’re investigating some errors just now. I’m just wondering what property of the PDF in question it’s looking at here, and if there’s any way we can mitigate against that. It’s certainly not the title. One is: java.lang.RuntimeException: java.io.IOException: Not a number: + java.lang.RuntimeException: java.io.IOException: Not a number: + at org.apache.pdfbox.pdfparser.PDFStreamParser$1.tryNext(PDFStreamParser.java:178) at org.apache.pdfbox.pdfparser.PDFStreamParser$1.hasNext(PDFStreamParser.java:187) at org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:266) at org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:251) at org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:225) at org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:442) at org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:366) at org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:322) at org.dspace.app.mediafilter.PDFFilter.getDestinationStream(PDFFilter.java:101) And here’s another: java.lang.NumberFormatException: For input string: "dup" java.lang.NumberFormatException: For input string: "dup" at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65) at java.lang.Integer.parseInt(Integer.java:492) at java.lang.Integer.parseInt(Integer.java:527) at org.apache.pdfbox.pdmodel.font.PDType1Font.getEncodingFromFont(PDType1Font.java:344) at org.apache.pdfbox.pdmodel.font.PDType1Font.determineEncoding(PDType1Font.java:280) at org.apache.pdfbox.pdmodel.font.PDFont.<init>(PDFont.java:181) at org.apache.pdfbox.pdmodel.font.PDSimpleFont.<init>(PDSimpleFont.java:83) at org.apache.pdfbox.pdmodel.font.PDType1Font.<init>(PDType1Font.java:152) at org.apache.pdfbox.pdmodel.font.PDFontFactory.createFont(PDFontFactory.java:108) at org.apache.pdfbox.pdmodel.font.PDFontFactory.createFont(PDFontFactory.java: 5) at org.apache.pdfbox.pdmodel.PDResources.getFonts(PDResources.java:115) Thanks Scott -- Scott Renton Digital Development Library and University Collections Argyle House, Floor F ext: 515219
The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336.
--------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]

