Hi, I've tried to run PDFDebugger from the latest PDFBox, what should be normal expected result ? As in my case it's just hanged out, after printing: Dec 03, 2019 7:58:51 AM org.apache.pdfbox.rendering.PDFRenderer suggestKCMS INFO: use the option -Dsun.java2d.cmm=sun.java2d.cmm.kcms.KcmsServiceProvider Dec 03, 2019 7:58:51 AM org.apache.pdfbox.rendering.PDFRenderer suggestKCMS INFO: or call System.setProperty("sun.java2d.cmm", "sun.java2d.cmm.kcms.KcmsServiceProvider")
Thanks On Mon, Dec 2, 2019 at 11:13 PM Tilman Hausherr <thaush...@t-online.de> wrote: > Send it to me, tilman at snafu dot de. > > (The readLangSysTable problem should be solved in 2.0.17, so make sure > you are using that one) > > Oops I see this is the tika list, so maybe that is a lower version. Please > retry with a "freshly downloaded" PDFDebugger of the pdfbox website. > > Tilman > > Am 02.12.2019 um 16:42 schrieb Slava G: > > I have pdf that reproduce similar problem : > > java.lang.OutOfMemoryError: Java heap space > > at org.apache.fontbox.ttf.GlyphSubstitutionTable.readLangSysTable( > GlyphSubstitutionTable.java:147) > > at org.apache.fontbox.ttf.GlyphSubstitutionTable.readScriptTable( > GlyphSubstitutionTable.java:129) > > at org.apache.fontbox.ttf.GlyphSubstitutionTable.readScriptList( > GlyphSubstitutionTable.java:98) > > at org.apache.fontbox.ttf.GlyphSubstitutionTable.read( > GlyphSubstitutionTable.java:78) > > at org.apache.fontbox.ttf.TrueTypeFont.readTable(TrueTypeFont.java:353) > > at org.apache.fontbox.ttf.TTFParser.parseTables(TTFParser.java:173) > > at org.apache.fontbox.ttf.TTFParser.parse(TTFParser.java:150) > > at org.apache.fontbox.ttf.TTFParser.parse(TTFParser.java:106) > > at org.apache.pdfbox.pdmodel.font.PDTrueTypeFont.<init>( > PDTrueTypeFont.java:198) > > at org.apache.pdfbox.pdmodel.font.PDFontFactory.createFont( > PDFontFactory.java:75) > > at org.apache.pdfbox.pdmodel.PDResources.getFont(PDResources.java:146) > > at org.apache.pdfbox.contentstream.operator.text.SetFontAndSize.process( > SetFontAndSize.java:60) > > at org.apache.pdfbox.contentstream.PDFStreamEngine.processOperator( > PDFStreamEngine.java:869) > > at org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators( > PDFStreamEngine.java:505) > > at org.apache.pdfbox.contentstream.PDFStreamEngine.processStream( > PDFStreamEngine.java:479) > > at org.apache.pdfbox.contentstream.PDFStreamEngine.processPage( > PDFStreamEngine.java:152) > > at org.apache.pdfbox.text.LegacyPDFStreamEngine.processPage( > LegacyPDFStreamEngine.java:139) > > at org.apache.pdfbox.text.PDFTextStripper.processPage( > PDFTextStripper.java:391) > > at org.apache.tika.parser.pdf.PDF2XHTML.processPage(PDF2XHTML.java:153) > > at org.apache.tika.parser.pdf.AbstractPDF2XHTML.processPages( > AbstractPDF2XHTML.java:835) > > at org.apache.pdfbox.text.PDFTextStripper.writeText( > PDFTextStripper.java:266) > > at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:124) > > at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:172) > > > To whom can I send pdf for investigation (it's from customer, can't send > it public). > > > Thanks > > On Sun, Nov 3, 2019 at 12:10 PM Slava G <slav...@gmail.com> wrote: > >> Well, it's not easy to provide those documents, as they're customers >> content and need approval, >> Need to get customer approval for that. I'll try, and will let you know.. >> Thanks >> >> On Sun, Nov 3, 2019 at 11:45 AM Tilman Hausherr <thaush...@t-online.de> >> wrote: >> >>> Hello, >>> >>> I'd be interested in the OOM exception. The one below aborts the >>> parsing. Can you open a PDFBox issue and attach your PDF? We could just >>> skip the table here instead of failing. >>> >>> Re the OOM we'd also need a PDF. >>> >>> Skipping parsing of embedded ttf will possibly have a negative impact on >>> text extraction. >>> >>> Tilman >>> >>> >>> Am 03.11.2019 um 10:38 schrieb Slava G: >>> > Hi, >>> > In some PDF files parsing we see different errors related to PDF >>> > parsing, one is OutOfMemmory exception during pdf parsing and another: >>> > >>> > WARN - Could not read embedded TTF for font ABCDEE+Segoe >>> > UI,BoldItalic >>> > java.io.IOException: Kerning sub-table too short, got 0 bytes, expect >>> > 6 or more. >>> > at >>> > >>> org.apache.fontbox.ttf.KerningSubtable.readSubtable0(KerningSubtable.java:191) >>> > at org.apache.fontbox.ttf.KerningSubtable.read(KerningSubtable.java:70) >>> > at org.apache.fontbox.ttf.KerningTable.read(KerningTable.java:80) >>> > at org.apache.fontbox.ttf.TrueTypeFont.readTable(TrueTypeFont.java:353) >>> > at org.apache.fontbox.ttf.TTFParser.parseTables(TTFParser.java:173) >>> > at org.apache.fontbox.ttf.TTFParser.parse(TTFParser.java:150) >>> > at org.apache.fontbox.ttf.TTFParser.parse(TTFParser.java:106) >>> > at >>> > >>> org.apache.pdfbox.pdmodel.font.PDTrueTypeFont.<init>(PDTrueTypeFont.java:198) >>> > at >>> > >>> org.apache.pdfbox.pdmodel.font.PDFontFactory.createFont(PDFontFactory.java:75) >>> > at org.apache.pdfbox.pdmodel.PDResources.getFont(PDResources.java:146) >>> > at >>> > >>> org.apache.pdfbox.contentstream.operator.text.SetFontAndSize.process(SetFontAndSize.java:60) >>> > at >>> > >>> org.apache.pdfbox.contentstream.PDFStreamEngine.processOperator(PDFStreamEngine.java:869) >>> > at >>> > >>> org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:505) >>> > at >>> > >>> org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:479) >>> > at >>> > >>> org.apache.pdfbox.contentstream.PDFStreamEngine.processPage(PDFStreamEngine.java:152) >>> > at >>> > >>> org.apache.pdfbox.text.LegacyPDFStreamEngine.processPage(LegacyPDFStreamEngine.java:139) >>> > at >>> > >>> org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:391) >>> > at org.apache.tika.parser.pdf.PDF2XHTML.processPage(PDF2XHTML.java:153) >>> > at >>> > >>> org.apache.tika.parser.pdf.AbstractPDF2XHTML.processPages(AbstractPDF2XHTML.java:835) >>> > at >>> > >>> org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:266) >>> > at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:124) >>> > at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:172) >>> > >>> > How can I skip parsing of embedded TTF inside PDF ? >>> > >>> > Thanks >>> >>> >>> >