The issue is on the class COSFloat there's a quite suspicious try catch where there's the comment with the reference to other open and similar bugs. There's a regexp that does not match "-." Tika is used to parse the PDF and transform into an HTML PDFBOX calls org.apache.pdfbox.text.PDFTextStripper.processPage
Here's the stack | org.apache.tika.exception.TikaException: Unable to extract all PDF content | at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:184) | at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:144) | at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) | at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) | at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) | at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:136) | at java.lang.Thread.run(Thread.java:748) | Caused by: java.io.IOException: Error expected floating point number actual='-.' | at org.apache.pdfbox.cos.COSFloat.<init>(COSFloat.java:81) | at org.apache.pdfbox.cos.COSNumber.get(COSNumber.java:115) | at org.apache.pdfbox.pdfparser.PDFStreamParser.parseNextToken(PDFStreamParser.java:263) | at org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:479) | at org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:446) | at org.apache.pdfbox.contentstream.PDFStreamEngine.processPage(PDFStreamEngine.java:149) | at org.apache.pdfbox.text.LegacyPDFStreamEngine.processPage(LegacyPDFStreamEngine.java:139) | at org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:391) | at org.apache.tika.parser.pdf.PDF2XHTML.processPage(PDF2XHTML.java:214) | at org.apache.pdfbox.text.PDFTextStripper.processPages(PDFTextStripper.java:319) | at org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:266) | at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:160) | ... 113 common frames omitted | Caused by: java.lang.NumberFormatException: null | at java.math.BigDecimal.<init>(BigDecimal.java:498) | at java.math.BigDecimal.<init>(BigDecimal.java:383) | at java.math.BigDecimal.<init>(BigDecimal.java:806) | at org.apache.pdfbox.cos.COSFloat.<init>(COSFloat.java:59) | ... 124 common frames omitted On 2017-06-29 21:22 (+0200), Tilman Hausherr <[email protected]> wrote: > Am 29.06.2017 um 12:33 schrieb Daniel MendesDaSilva:> > >> > > Daniel Mendes da Silva> > > Senior Analyst Programmer> > >> > > From: Daniel MendesDaSilva> > > Sent: 29 June 2017 12:25> > > To: '[email protected]'; '[email protected]'> > > Subject: PDFBOX - TIKA - PDF parsing error> > > Importance: High> > >> > > Hi,> > >> > > We're using PdfBox through Tika and we get an exception when parsing a 5 MB > > PDF file - I'm not able to attach to this mail.> > >> > > Any ideas why we have this error?> > > Why PDFBOX is trying to parse "-." as a number ?> > >> > >> > > Caused by: java.io.IOException: Error expected floating point number > > actual='-.'> > > Caused by: java.lang.NumberFormatException: null> > > Most likely your PDF is incorrect, even if Adobe Reader can display it. > > At some place, PDFBox expects a floating point number and gets "-." > > instead. Please upload the PDF to a sharehoster.> > > Tilman> > > > ---------------------------------------------------------------------> > To unsubscribe, e-mail: [email protected]> > For additional commands, e-mail: users

