Thanks Maruan. Yes, we’re currently only on 4.6 for George’s repository. Interestingly, we’re in the process of upgrading to 5, which is in test and should have the higher version, so I think we should try extracting the item on there and see if that works. If not, I will get back in touch!
Cheers again Scott -- Scott Renton Digital Development Library and University Collections Argyle House, Floor F ext: 515219 On 01/06/2017 11:43, "Maruan Sahyoun" <[email protected]> wrote: >btw - which version of DSpace is in use? AFAIK 5.1 already uses pdfbox 1.8.x >and 2.0 > >> Am 01.06.2017 um 12:40 schrieb Maruan Sahyoun <[email protected]>: >> >> Hi, >> >>> Am 01.06.2017 um 12:25 schrieb RENTON Scott <[email protected]>: >>> >>> Hi Maruan, thanks for the swift response. It looks like it’s 1.6.0 (quite >>> old?)- that’s certainly the .jar that’s sitting in the dspace lib >>> directory. I’ve copied in George as he’s investigating this too; George, I >>> take it we’re ok to send Maruan a link to the relevant records in the >>> repository? >>> >> >> you should really upgrade either to the latest 1.8 release or to 2.0 release >> (the 1.8 API is more in line with 1.6 where 2.0 saw several changes - >> development now mainly goes into 2.0). In both there were many additions >> when it comes to parsing malformed PDFs. In addition - with the tremendous >> help of the TIKA colleagues - text extraction is now run against a much >> larger test corpus. >> >> You can download the pdfbox-app…jar >> >> http://www-us.apache.org/dist/pdfbox/2.0.6/pdfbox-app-2.0.6.jar >> http://www-us.apache.org/dist/pdfbox/1.8.13/pdfbox-app-1.8.13.jar >> >> and run the ExtractText command line tool to verify if the issue you are >> facing is still relevant with the newer versions. >> >> 1.6.0 has been release in Juyl 2011 - so yes, quite old. >> >> BR >> Maruan >> >> >>> Cheers >>> Scott >>> -- >>> Scott Renton >>> >>> Digital Development >>> Library and University Collections >>> Argyle House, Floor F >>> ext: 515219 >>> >>> >>> >>> >>> >>> >>> >>> >>> On 01/06/2017 11:18, "Maruan Sahyoun" <[email protected]> wrote: >>> >>>> Hi Scott, >>>> >>>> which version of PDFBox are you using? Is it possible to share one of the >>>> PDFs at a public location? >>>> >>>> BR >>>> Maruan >>>> >>>>> Am 01.06.2017 um 12:11 schrieb RENTON Scott <[email protected]>: >>>>> >>>>> >>>>> Hi folks (apologies- hit send too soon) >>>>> >>>>> We run pdfbox for pdf text extraction under the Dspace application. >>>>> >>>>> Occasionally we get the odd failure, and we’re investigating some errors >>>>> just now. I’m just wondering what property of the PDF in question it’s >>>>> looking at here, and if there’s any way we can mitigate against that. >>>>> It’s certainly not the title. >>>>> >>>>> >>>>> One is: >>>>> java.lang.RuntimeException: java.io.IOException: Not a number: + >>>>> java.lang.RuntimeException: java.io.IOException: Not a number: + >>>>> at >>>>> org.apache.pdfbox.pdfparser.PDFStreamParser$1.tryNext(PDFStreamParser.java:178) >>>>> at >>>>> org.apache.pdfbox.pdfparser.PDFStreamParser$1.hasNext(PDFStreamParser.java:187) >>>>> at >>>>> org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:266) >>>>> at >>>>> org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:251) >>>>> at >>>>> org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:225) >>>>> at >>>>> org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:442) >>>>> at >>>>> org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:366) >>>>> at >>>>> org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:322) >>>>> at >>>>> org.dspace.app.mediafilter.PDFFilter.getDestinationStream(PDFFilter.java:101) >>>>> >>>>> >>>>> And here’s another: >>>>> >>>>> java.lang.NumberFormatException: For input string: "dup" >>>>> java.lang.NumberFormatException: For input string: "dup" >>>>> at >>>>> java.lang.NumberFormatException.forInputString(NumberFormatException.java:65) >>>>> at java.lang.Integer.parseInt(Integer.java:492) >>>>> at java.lang.Integer.parseInt(Integer.java:527) >>>>> at >>>>> org.apache.pdfbox.pdmodel.font.PDType1Font.getEncodingFromFont(PDType1Font.java:344) >>>>> at >>>>> org.apache.pdfbox.pdmodel.font.PDType1Font.determineEncoding(PDType1Font.java:280) >>>>> at org.apache.pdfbox.pdmodel.font.PDFont.<init>(PDFont.java:181) >>>>> at >>>>> org.apache.pdfbox.pdmodel.font.PDSimpleFont.<init>(PDSimpleFont.java:83) >>>>> at org.apache.pdfbox.pdmodel.font.PDType1Font.<init>(PDType1Font.java:152) >>>>> at >>>>> org.apache.pdfbox.pdmodel.font.PDFontFactory.createFont(PDFontFactory.java:108) >>>>> at >>>>> org.apache.pdfbox.pdmodel.font.PDFontFactory.createFont(PDFontFactory.java: >>>>> 5) >>>>> at org.apache.pdfbox.pdmodel.PDResources.getFonts(PDResources.java:115) >>>>> >>>>> Thanks >>>>> Scott >>>>> -- >>>>> Scott Renton >>>>> Digital Development >>>>> Library and University Collections >>>>> Argyle House, Floor F >>>>> ext: 515219 >>>>> >>>>> The University of Edinburgh is a charitable body, registered in >>>>> Scotland, with registration number SC005336. >>>>> >>>>> >>>>> --------------------------------------------------------------------- >>>>> To unsubscribe, e-mail: [email protected] >>>>> For additional commands, e-mail: [email protected] >>>> >>>> >>>> --------------------------------------------------------------------- >>>> To unsubscribe, e-mail: [email protected] >>>> For additional commands, e-mail: [email protected] >>>> >>> >>> -- >>> The University of Edinburgh is a charitable body, registered in >>> Scotland, with registration number SC005336. >>> >>> >>> --------------------------------------------------------------------- >>> To unsubscribe, e-mail: [email protected] >>> For additional commands, e-mail: [email protected] >>> >> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: [email protected] >> For additional commands, e-mail: [email protected] >> > > >--------------------------------------------------------------------- >To unsubscribe, e-mail: [email protected] >For additional commands, e-mail: [email protected] > -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]

