Re: How to skip parsing embedded TTF inside PDF

Slava G Sun, 03 Nov 2019 02:11:21 -0800

Well, it's not easy to provide those documents, as they're customers
content and need approval,
Need to get customer approval for that. I'll try, and will let you know..
Thanks


On Sun, Nov 3, 2019 at 11:45 AM Tilman Hausherr <[email protected]>
wrote:

> Hello,
>
> I'd be interested in the OOM exception. The one below aborts the
> parsing. Can you open a PDFBox issue and attach your PDF? We could just
> skip the table here instead of failing.
>
> Re the OOM we'd also need a PDF.
>
> Skipping parsing of embedded ttf will possibly have a negative impact on
> text extraction.
>
> Tilman
>
>
> Am 03.11.2019 um 10:38 schrieb Slava G:
> > Hi,
> > In some PDF files parsing we see different errors related to PDF
> > parsing, one is OutOfMemmory exception during pdf parsing and another:
> >
> > WARN      - Could not read embedded TTF for font ABCDEE+Segoe
> > UI,BoldItalic
> > java.io.IOException: Kerning sub-table too short, got 0 bytes, expect
> > 6 or more.
> > at
> >
> org.apache.fontbox.ttf.KerningSubtable.readSubtable0(KerningSubtable.java:191)
> > at org.apache.fontbox.ttf.KerningSubtable.read(KerningSubtable.java:70)
> > at org.apache.fontbox.ttf.KerningTable.read(KerningTable.java:80)
> > at org.apache.fontbox.ttf.TrueTypeFont.readTable(TrueTypeFont.java:353)
> > at org.apache.fontbox.ttf.TTFParser.parseTables(TTFParser.java:173)
> > at org.apache.fontbox.ttf.TTFParser.parse(TTFParser.java:150)
> > at org.apache.fontbox.ttf.TTFParser.parse(TTFParser.java:106)
> > at
> >
> org.apache.pdfbox.pdmodel.font.PDTrueTypeFont.<init>(PDTrueTypeFont.java:198)
> > at
> >
> org.apache.pdfbox.pdmodel.font.PDFontFactory.createFont(PDFontFactory.java:75)
> > at org.apache.pdfbox.pdmodel.PDResources.getFont(PDResources.java:146)
> > at
> >
> org.apache.pdfbox.contentstream.operator.text.SetFontAndSize.process(SetFontAndSize.java:60)
> > at
> >
> org.apache.pdfbox.contentstream.PDFStreamEngine.processOperator(PDFStreamEngine.java:869)
> > at
> >
> org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:505)
> > at
> >
> org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:479)
> > at
> >
> org.apache.pdfbox.contentstream.PDFStreamEngine.processPage(PDFStreamEngine.java:152)
> > at
> >
> org.apache.pdfbox.text.LegacyPDFStreamEngine.processPage(LegacyPDFStreamEngine.java:139)
> > at
> >
> org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:391)
> > at org.apache.tika.parser.pdf.PDF2XHTML.processPage(PDF2XHTML.java:153)
> > at
> >
> org.apache.tika.parser.pdf.AbstractPDF2XHTML.processPages(AbstractPDF2XHTML.java:835)
> > at
> >
> org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:266)
> > at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:124)
> > at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:172)
> >
> > How can I skip parsing of embedded TTF inside PDF ?
> >
> > Thanks
>
>
>

Re: How to skip parsing embedded TTF inside PDF

Reply via email to