Hi,

> Am 01.06.2017 um 12:25 schrieb RENTON Scott <[email protected]>:
> 
> Hi Maruan, thanks for the swift response. It looks like it’s 1.6.0 (quite 
> old?)- that’s certainly the .jar that’s sitting in the dspace lib directory. 
> I’ve copied in George as he’s investigating this too; George, I take it we’re 
> ok to send Maruan a link to the relevant records in the repository?
> 

you should really upgrade either to the latest 1.8 release or to 2.0 release 
(the 1.8 API is more in line with 1.6 where 2.0 saw several changes - 
development now mainly goes into 2.0). In both there were many additions when 
it comes to parsing malformed PDFs. In addition - with the tremendous help of 
the TIKA colleagues - text extraction is now run against a much larger test 
corpus.

You can download the pdfbox-app…jar

http://www-us.apache.org/dist/pdfbox/2.0.6/pdfbox-app-2.0.6.jar
http://www-us.apache.org/dist/pdfbox/1.8.13/pdfbox-app-1.8.13.jar

and run the ExtractText command line tool to verify if the issue you are facing 
is still relevant with the newer versions.

1.6.0 has been release in Juyl 2011 - so yes, quite old.

BR
Maruan


> Cheers
> Scott
> -- 
> Scott Renton
> 
> Digital Development
> Library and University Collections
> Argyle House, Floor F
> ext: 515219
> 
> 
> 
> 
> 
> 
> 
> 
> On 01/06/2017 11:18, "Maruan Sahyoun" <[email protected]> wrote:
> 
>> Hi Scott,
>> 
>> which version of PDFBox are you using? Is it possible to share one of the 
>> PDFs at a public location?
>> 
>> BR
>> Maruan
>> 
>>> Am 01.06.2017 um 12:11 schrieb RENTON Scott <[email protected]>:
>>> 
>>> 
>>> Hi folks (apologies- hit send too soon)
>>> 
>>> We run pdfbox for pdf text extraction under the Dspace application.
>>> 
>>> Occasionally we get the odd failure, and we’re investigating some errors 
>>> just now. I’m just wondering what property of the PDF in question it’s 
>>> looking at here, and if there’s any way we can mitigate against that. It’s 
>>> certainly not the title.
>>> 
>>> 
>>> One is:
>>> java.lang.RuntimeException: java.io.IOException: Not a number: +
>>> java.lang.RuntimeException: java.io.IOException: Not a number: +
>>> at 
>>> org.apache.pdfbox.pdfparser.PDFStreamParser$1.tryNext(PDFStreamParser.java:178)
>>> at 
>>> org.apache.pdfbox.pdfparser.PDFStreamParser$1.hasNext(PDFStreamParser.java:187)
>>> at 
>>> org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:266)
>>> at 
>>> org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:251)
>>> at 
>>> org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:225)
>>> at 
>>> org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:442)
>>> at 
>>> org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:366)
>>> at 
>>> org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:322)
>>> at 
>>> org.dspace.app.mediafilter.PDFFilter.getDestinationStream(PDFFilter.java:101)
>>> 
>>> 
>>> And here’s another:
>>> 
>>> java.lang.NumberFormatException: For input string: "dup"
>>> java.lang.NumberFormatException: For input string: "dup"
>>> at 
>>> java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
>>> at java.lang.Integer.parseInt(Integer.java:492)
>>> at java.lang.Integer.parseInt(Integer.java:527)
>>> at 
>>> org.apache.pdfbox.pdmodel.font.PDType1Font.getEncodingFromFont(PDType1Font.java:344)
>>> at 
>>> org.apache.pdfbox.pdmodel.font.PDType1Font.determineEncoding(PDType1Font.java:280)
>>> at org.apache.pdfbox.pdmodel.font.PDFont.<init>(PDFont.java:181)
>>> at org.apache.pdfbox.pdmodel.font.PDSimpleFont.<init>(PDSimpleFont.java:83)
>>> at org.apache.pdfbox.pdmodel.font.PDType1Font.<init>(PDType1Font.java:152)
>>> at 
>>> org.apache.pdfbox.pdmodel.font.PDFontFactory.createFont(PDFontFactory.java:108)
>>> at 
>>> org.apache.pdfbox.pdmodel.font.PDFontFactory.createFont(PDFontFactory.java:
>>> 5)
>>> at org.apache.pdfbox.pdmodel.PDResources.getFonts(PDResources.java:115)
>>> 
>>> Thanks
>>> Scott
>>> -- 
>>> Scott Renton
>>> Digital Development
>>> Library and University Collections
>>> Argyle House, Floor F
>>> ext: 515219
>>> 
>>> The University of Edinburgh is a charitable body, registered in
>>> Scotland, with registration number SC005336.
>>> 
>>> 
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: [email protected]
>>> For additional commands, e-mail: [email protected]
>> 
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [email protected]
>> For additional commands, e-mail: [email protected]
>> 
> 
> -- 
> The University of Edinburgh is a charitable body, registered in
> Scotland, with registration number SC005336.
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to