Re: PDFBOX - TIKA - PDF parsing error

Daniel MendesDaSilva Fri, 30 Jun 2017 04:38:03 -0700

The issue is on the class COSFloat there's a quite suspicious try catch where 
there's the comment with the reference to other open and similar bugs.
There's a regexp that does not match "-."
Tika is used to parse the PDF and transform into an HTML PDFBOX calls 
org.apache.pdfbox.text.PDFTextStripper.processPage



Here's the stack

| org.apache.tika.exception.TikaException: Unable to extract all PDF content
|          at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:184)
|          at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:144)
|          at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
|          at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
|          at 
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
|          at 
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:136)
 |          at java.lang.Thread.run(Thread.java:748)
| Caused by: java.io.IOException: Error expected floating point number 
actual='-.'
|          at org.apache.pdfbox.cos.COSFloat.<init>(COSFloat.java:81)
|          at org.apache.pdfbox.cos.COSNumber.get(COSNumber.java:115)
|          at 
org.apache.pdfbox.pdfparser.PDFStreamParser.parseNextToken(PDFStreamParser.java:263)
|          at 
org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:479)
|          at 
org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:446)
|          at 
org.apache.pdfbox.contentstream.PDFStreamEngine.processPage(PDFStreamEngine.java:149)
|          at 
org.apache.pdfbox.text.LegacyPDFStreamEngine.processPage(LegacyPDFStreamEngine.java:139)
|          at 
org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:391)
|          at 
org.apache.tika.parser.pdf.PDF2XHTML.processPage(PDF2XHTML.java:214)
|          at 
org.apache.pdfbox.text.PDFTextStripper.processPages(PDFTextStripper.java:319)
|          at 
org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:266)
|          at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:160)
|          ... 113 common frames omitted
| Caused by: java.lang.NumberFormatException: null
|          at java.math.BigDecimal.<init>(BigDecimal.java:498)
|          at java.math.BigDecimal.<init>(BigDecimal.java:383)
|          at java.math.BigDecimal.<init>(BigDecimal.java:806)
|          at org.apache.pdfbox.cos.COSFloat.<init>(COSFloat.java:59)
|          ... 124 common frames omitted

On 2017-06-29 21:22 (+0200), Tilman Hausherr <[email protected]> wrote:
> Am 29.06.2017 um 12:33 schrieb Daniel MendesDaSilva:>
> >>
> > Daniel Mendes da Silva>
> > Senior Analyst Programmer>
> >>
> > From: Daniel MendesDaSilva>
> > Sent: 29 June 2017 12:25>
> > To: '[email protected]'; '[email protected]'>
> > Subject: PDFBOX - TIKA - PDF parsing error>
> > Importance: High>
> >>
> > Hi,>
> >>
> > We're using PdfBox through Tika and we get an exception when parsing a 5 MB 
> > PDF file - I'm not able to attach to this mail.>

> >>
> > Any ideas why we have this error?>
> > Why PDFBOX is trying to parse "-." as a number ?>
> >>
> >>
> > Caused by: java.io.IOException: Error expected floating point number 
> > actual='-.'>
> > Caused by: java.lang.NumberFormatException: null>
>
> Most likely your PDF is incorrect, even if Adobe Reader can display it. >
> At some place, PDFBox expects a floating point number and gets "-." >
> instead. Please upload the PDF to a sharehoster.>
>
> Tilman>
>
>
> --------------------------------------------------------------------->
> To unsubscribe, e-mail: [email protected]>
> For additional commands, e-mail: users

Re: PDFBOX - TIKA - PDF parsing error

Reply via email to