I have a situation where we are crawling a website with Norconex which uses PDFBox behind the scenes. Here's a stack trace of the error I'm encountering. You can find the document in the first line. I do see this on other documents as well, so it's not just this one. I am using pdfbox 2.0.8 in this setup.
I have evidence of other PDFs being parsed successfully, so this is just a sporadic issue that probably comes back around to how the PDF was generated (which is unknown to me), but I need to ask for counsel on this one. WARN - Could not import https://www.skiffmed.com/media/cms/Iowa_Ortho__complete_history_form_0BBBB41 88490A.pdf com.norconex.importer.parser.DocumentParserException: org.apache.tika.exception.TikaException: Unable to extract PDF content at com.norconex.importer.parser.impl.AbstractTikaParser.parseDocument(AbstractT ikaParser.java:154) at com.norconex.importer.Importer.parseDocument(Importer.java:414) at com.norconex.importer.Importer.importDocument(Importer.java:313) at com.norconex.importer.Importer.doImportDocument(Importer.java:266) at com.norconex.importer.Importer.importDocument(Importer.java:190) at com.norconex.collector.core.pipeline.importer.ImportModuleStage.execute(Impo rtModuleStage.java:37) at com.norconex.collector.core.pipeline.importer.ImportModuleStage.execute(Impo rtModuleStage.java:26) at com.norconex.commons.lang.pipeline.Pipeline.execute(Pipeline.java:91) at com.norconex.collector.http.crawler.HttpCrawler.executeImporterPipeline(Http Crawler.java:360) at com.norconex.collector.core.crawler.AbstractCrawler.processNextQueuedCrawlDa ta(AbstractCrawler.java:538) at com.norconex.collector.core.crawler.AbstractCrawler.processNextReference(Abs tractCrawler.java:419) at com.norconex.collector.core.crawler.AbstractCrawler$ProcessReferencesRunnabl e.run(AbstractCrawler.java:812) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:11 49) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:6 24) at java.lang.Thread.run(Thread.java:748) Caused by: org.apache.tika.exception.TikaException: Unable to extract PDF content at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:139) at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:167) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:135) at org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:188) at com.norconex.importer.parser.impl.AbstractTikaParser$MergeEmbeddedParser.par se(AbstractTikaParser.java:416) at com.norconex.importer.parser.impl.AbstractTikaParser.parseDocument(AbstractT ikaParser.java:150) ... 14 more Caused by: java.io.IOException: java.util.zip.DataFormatException: invalid bit length repeat at org.apache.pdfbox.filter.FlateFilter.decode(FlateFilter.java:83) at org.apache.pdfbox.cos.COSInputStream.create(COSInputStream.java:69) at org.apache.pdfbox.cos.COSStream.createInputStream(COSStream.java:167) at org.apache.pdfbox.pdmodel.PDPage.getContents(PDPage.java:155) at org.apache.pdfbox.pdfparser.PDFStreamParser.<init>(PDFStreamParser.java:91) at org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFSt reamEngine.java:485) at org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngin e.java:469) at org.apache.pdfbox.contentstream.PDFStreamEngine.processPage(PDFStreamEngine. java:150) at org.apache.pdfbox.text.LegacyPDFStreamEngine.processPage(LegacyPDFStreamEngi ne.java:139) at org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:391) at org.apache.tika.parser.pdf.PDF2XHTML.processPage(PDF2XHTML.java:147) at org.apache.pdfbox.text.PDFTextStripper.processPages(PDFTextStripper.java:319 ) at org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:266) at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:117) ... 21 more Caused by: java.util.zip.DataFormatException: invalid bit length repeat at java.util.zip.Inflater.inflateBytes(Native Method) at java.util.zip.Inflater.inflate(Inflater.java:259) at java.util.zip.Inflater.inflate(Inflater.java:280) at org.apache.pdfbox.filter.FlateFilter.decompress(FlateFilter.java:108) at org.apache.pdfbox.filter.FlateFilter.decode(FlateFilter.java:74) ... 34 more

