Hi! I have a specific PDF file which is very heavy (more than 1GB) and I'm trying to scrap text from it, but getting this error:
Caused by: java.lang.IllegalArgumentException: capacity < 0: (-2115587440 < > 0) > at java.base/java.nio.Buffer.createCapacityException(Buffer.java:290) > ~[na:na] > at java.base/java.nio.ByteBuffer.allocate(ByteBuffer.java:390) ~[na:na] > at > org.apache.pdfbox.io.RandomAccessReadBuffer.<init>(RandomAccessReadBuffer.java:70) > ~[pdfbox-io-3.0.2.jar!/:3.0.2] > at > org.apache.pdfbox.io.RandomAccessReadWriteBuffer.<init>(RandomAccessReadWriteBuffer.java:40) > ~[pdfbox-io-3.0.2.jar!/:3.0.2] > at org.apache.pdfbox.filter.Filter.decode(Filter.java:250) > ~[pdfbox-3.0.2.jar!/:3.0.2] > at org.apache.pdfbox.cos.COSStream.createView(COSStream.java:196) > ~[pdfbox-3.0.2.jar!/:3.0.2] > at > org.apache.pdfbox.pdmodel.PDPage.getContentsForRandomAccess(PDPage.java:177) > ~[pdfbox-3.0.2.jar!/:3.0.2] > at > org.apache.pdfbox.pdfparser.PDFStreamParser.<init>(PDFStreamParser.java:59) > ~[pdfbox-3.0.2.jar!/:3.0.2] > at > org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:525) > ~[pdfbox-3.0.2.jar!/:3.0.2] > at > org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:506) > ~[pdfbox-3.0.2.jar!/:3.0.2] > at > org.apache.pdfbox.contentstream.PDFStreamEngine.processPage(PDFStreamEngine.java:153) > ~[pdfbox-3.0.2.jar!/:3.0.2] > at > org.apache.pdfbox.text.LegacyPDFStreamEngine.processPage(LegacyPDFStreamEngine.java:153) > ~[pdfbox-3.0.2.jar!/:3.0.2] > at > org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:362) > ~[pdfbox-3.0.2.jar!/:3.0.2] > at > org.apache.pdfbox.text.PDFTextStripper.processPages(PDFTextStripper.java:288) > ~[pdfbox-3.0.2.jar!/:3.0.2] > at > org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:235) > ~[pdfbox-3.0.2.jar!/:3.0.2] I'm wondering if it's possible to somehow make it running or is it not supported? Here is the problematic file: https://mega.nz/file/NscTmJSL#Bp4TL4UjqUqgMykNO_f7j33y3n0Zwy12K7fNr45GYF8 I opened it with Acrobat Reader, it doesn't look like corrupted. It loads significantly long, but finally I am able to select text there etc. Best regards, Patrycja Zaremba On Thu, 20 Jun 2024 at 23:42, Patrycja Zaremba < patrycja.zare...@schibsted.com> wrote: > Hi, > I got this error when converting PDF to and image. > > >> *ERROR org.apache.pdfbox.pdmodel.font.PDType1Font -- Can't read the >> embedded Type1 font CJGKGJ+HelveticaAB-Halvfetjava.io.IOException: Found >> Token[kind=NAME, text=y→� ~$;VᄅoリᅱuᅩAb →→ !ᅰᄐe�ᅩwロᄂニDᆳ"i ハ.] but >> expected ND* >> at org.apache.fontbox.type1.Type1Parser.readDef(Type1Parser.java:839) >> at >> org.apache.fontbox.type1.Type1Parser.readCharStrings(Type1Parser.java:804) >> at org.apache.fontbox.type1.Type1Parser.parseBinary(Type1Parser.java:647) >> at org.apache.fontbox.type1.Type1Parser.parse(Type1Parser.java:71) >> at >> org.apache.fontbox.type1.Type1Font.createWithSegments(Type1Font.java:85) >> at org.apache.pdfbox.pdmodel.font.PDType1Font.<init>(PDType1Font.java:243) >> at >> org.apache.pdfbox.pdmodel.font.PDFontFactory.createFont(PDFontFactory.java:140) >> at org.apache.pdfbox.pdmodel.PDResources.getFont(PDResources.java:170) >> at >> org.apache.pdfbox.contentstream.operator.text.SetFontAndSize.process(SetFontAndSize.java:72) >> at >> org.apache.pdfbox.contentstream.PDFStreamEngine.processOperator(PDFStreamEngine.java:893) >> at >> org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:531) >> at >> org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:506) >> at >> org.apache.pdfbox.contentstream.PDFStreamEngine.processPage(PDFStreamEngine.java:153) >> at org.apache.pdfbox.rendering.PageDrawer.drawPage(PageDrawer.java:286) >> at >> org.apache.pdfbox.rendering.PDFRenderer.renderImage(PDFRenderer.java:330) >> at >> org.apache.pdfbox.rendering.PDFRenderer.renderImage(PDFRenderer.java:247) >> at >> org.apache.pdfbox.rendering.PDFRenderer.renderImageWithDPI(PDFRenderer.java:219) >> > > PDFBox version 3.0.2. > > Is this an issue with PDF itself or the library? > Here is example of problematic PDF: > https://mega.nz/file/osthCK6Q#UVoaV75ExP9ro_x2hNvbP3xEmK-tkZja3eiwG7S8Ilc > > Best regards, > > Patrycja Zaremba >