Hi, > Am 01.06.2017 um 12:25 schrieb RENTON Scott <[email protected]>: > > Hi Maruan, thanks for the swift response. It looks like it’s 1.6.0 (quite > old?)- that’s certainly the .jar that’s sitting in the dspace lib directory. > I’ve copied in George as he’s investigating this too; George, I take it we’re > ok to send Maruan a link to the relevant records in the repository? >
you should really upgrade either to the latest 1.8 release or to 2.0 release (the 1.8 API is more in line with 1.6 where 2.0 saw several changes - development now mainly goes into 2.0). In both there were many additions when it comes to parsing malformed PDFs. In addition - with the tremendous help of the TIKA colleagues - text extraction is now run against a much larger test corpus. You can download the pdfbox-app…jar http://www-us.apache.org/dist/pdfbox/2.0.6/pdfbox-app-2.0.6.jar http://www-us.apache.org/dist/pdfbox/1.8.13/pdfbox-app-1.8.13.jar and run the ExtractText command line tool to verify if the issue you are facing is still relevant with the newer versions. 1.6.0 has been release in Juyl 2011 - so yes, quite old. BR Maruan > Cheers > Scott > -- > Scott Renton > > Digital Development > Library and University Collections > Argyle House, Floor F > ext: 515219 > > > > > > > > > On 01/06/2017 11:18, "Maruan Sahyoun" <[email protected]> wrote: > >> Hi Scott, >> >> which version of PDFBox are you using? Is it possible to share one of the >> PDFs at a public location? >> >> BR >> Maruan >> >>> Am 01.06.2017 um 12:11 schrieb RENTON Scott <[email protected]>: >>> >>> >>> Hi folks (apologies- hit send too soon) >>> >>> We run pdfbox for pdf text extraction under the Dspace application. >>> >>> Occasionally we get the odd failure, and we’re investigating some errors >>> just now. I’m just wondering what property of the PDF in question it’s >>> looking at here, and if there’s any way we can mitigate against that. It’s >>> certainly not the title. >>> >>> >>> One is: >>> java.lang.RuntimeException: java.io.IOException: Not a number: + >>> java.lang.RuntimeException: java.io.IOException: Not a number: + >>> at >>> org.apache.pdfbox.pdfparser.PDFStreamParser$1.tryNext(PDFStreamParser.java:178) >>> at >>> org.apache.pdfbox.pdfparser.PDFStreamParser$1.hasNext(PDFStreamParser.java:187) >>> at >>> org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:266) >>> at >>> org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:251) >>> at >>> org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:225) >>> at >>> org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:442) >>> at >>> org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:366) >>> at >>> org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:322) >>> at >>> org.dspace.app.mediafilter.PDFFilter.getDestinationStream(PDFFilter.java:101) >>> >>> >>> And here’s another: >>> >>> java.lang.NumberFormatException: For input string: "dup" >>> java.lang.NumberFormatException: For input string: "dup" >>> at >>> java.lang.NumberFormatException.forInputString(NumberFormatException.java:65) >>> at java.lang.Integer.parseInt(Integer.java:492) >>> at java.lang.Integer.parseInt(Integer.java:527) >>> at >>> org.apache.pdfbox.pdmodel.font.PDType1Font.getEncodingFromFont(PDType1Font.java:344) >>> at >>> org.apache.pdfbox.pdmodel.font.PDType1Font.determineEncoding(PDType1Font.java:280) >>> at org.apache.pdfbox.pdmodel.font.PDFont.<init>(PDFont.java:181) >>> at org.apache.pdfbox.pdmodel.font.PDSimpleFont.<init>(PDSimpleFont.java:83) >>> at org.apache.pdfbox.pdmodel.font.PDType1Font.<init>(PDType1Font.java:152) >>> at >>> org.apache.pdfbox.pdmodel.font.PDFontFactory.createFont(PDFontFactory.java:108) >>> at >>> org.apache.pdfbox.pdmodel.font.PDFontFactory.createFont(PDFontFactory.java: >>> 5) >>> at org.apache.pdfbox.pdmodel.PDResources.getFonts(PDResources.java:115) >>> >>> Thanks >>> Scott >>> -- >>> Scott Renton >>> Digital Development >>> Library and University Collections >>> Argyle House, Floor F >>> ext: 515219 >>> >>> The University of Edinburgh is a charitable body, registered in >>> Scotland, with registration number SC005336. >>> >>> >>> --------------------------------------------------------------------- >>> To unsubscribe, e-mail: [email protected] >>> For additional commands, e-mail: [email protected] >> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: [email protected] >> For additional commands, e-mail: [email protected] >> > > -- > The University of Edinburgh is a charitable body, registered in > Scotland, with registration number SC005336. > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [email protected] > For additional commands, e-mail: [email protected] > --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]

