Re: How to skip parsing embedded TTF inside PDF

Slava G Mon, 02 Dec 2019 22:17:15 -0800

Hi,
I've tried to run PDFDebugger from the latest PDFBox, what should be normal
expected result ? As in my case it's just hanged out, after printing:
Dec 03, 2019 7:58:51 AM org.apache.pdfbox.rendering.PDFRenderer suggestKCMS
INFO:   use the option
-Dsun.java2d.cmm=sun.java2d.cmm.kcms.KcmsServiceProvider
Dec 03, 2019 7:58:51 AM org.apache.pdfbox.rendering.PDFRenderer suggestKCMS
INFO:   or call System.setProperty("sun.java2d.cmm",
"sun.java2d.cmm.kcms.KcmsServiceProvider")


Thanks

On Mon, Dec 2, 2019 at 11:13 PM Tilman Hausherr <thaush...@t-online.de>
wrote:

> Send it to me,  tilman at snafu dot  de.
>
> (The readLangSysTable problem should be solved in 2.0.17, so make sure
> you are using that one)
>
> Oops I see this is the tika list, so maybe that is a lower version. Please
> retry with a "freshly downloaded" PDFDebugger of the pdfbox website.
>
> Tilman
>
> Am 02.12.2019 um 16:42 schrieb Slava G:
>
> I have pdf that reproduce similar problem :
>
> java.lang.OutOfMemoryError: Java heap space
>
> at org.apache.fontbox.ttf.GlyphSubstitutionTable.readLangSysTable(
> GlyphSubstitutionTable.java:147)
>
> at org.apache.fontbox.ttf.GlyphSubstitutionTable.readScriptTable(
> GlyphSubstitutionTable.java:129)
>
> at org.apache.fontbox.ttf.GlyphSubstitutionTable.readScriptList(
> GlyphSubstitutionTable.java:98)
>
> at org.apache.fontbox.ttf.GlyphSubstitutionTable.read(
> GlyphSubstitutionTable.java:78)
>
> at org.apache.fontbox.ttf.TrueTypeFont.readTable(TrueTypeFont.java:353)
>
> at org.apache.fontbox.ttf.TTFParser.parseTables(TTFParser.java:173)
>
> at org.apache.fontbox.ttf.TTFParser.parse(TTFParser.java:150)
>
> at org.apache.fontbox.ttf.TTFParser.parse(TTFParser.java:106)
>
> at org.apache.pdfbox.pdmodel.font.PDTrueTypeFont.<init>(
> PDTrueTypeFont.java:198)
>
> at org.apache.pdfbox.pdmodel.font.PDFontFactory.createFont(
> PDFontFactory.java:75)
>
> at org.apache.pdfbox.pdmodel.PDResources.getFont(PDResources.java:146)
>
> at org.apache.pdfbox.contentstream.operator.text.SetFontAndSize.process(
> SetFontAndSize.java:60)
>
> at org.apache.pdfbox.contentstream.PDFStreamEngine.processOperator(
> PDFStreamEngine.java:869)
>
> at org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(
> PDFStreamEngine.java:505)
>
> at org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(
> PDFStreamEngine.java:479)
>
> at org.apache.pdfbox.contentstream.PDFStreamEngine.processPage(
> PDFStreamEngine.java:152)
>
> at org.apache.pdfbox.text.LegacyPDFStreamEngine.processPage(
> LegacyPDFStreamEngine.java:139)
>
> at org.apache.pdfbox.text.PDFTextStripper.processPage(
> PDFTextStripper.java:391)
>
> at org.apache.tika.parser.pdf.PDF2XHTML.processPage(PDF2XHTML.java:153)
>
> at org.apache.tika.parser.pdf.AbstractPDF2XHTML.processPages(
> AbstractPDF2XHTML.java:835)
>
> at org.apache.pdfbox.text.PDFTextStripper.writeText(
> PDFTextStripper.java:266)
>
> at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:124)
>
> at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:172)
>
>
> To whom can I send pdf for investigation (it's from customer, can't send
> it public).
>
>
> Thanks
>
> On Sun, Nov 3, 2019 at 12:10 PM Slava G <slav...@gmail.com> wrote:
>
>> Well, it's not easy to provide those documents, as they're customers
>> content and need approval,
>> Need to get customer approval for that. I'll try, and will let you know..
>> Thanks
>>
>> On Sun, Nov 3, 2019 at 11:45 AM Tilman Hausherr <thaush...@t-online.de>
>> wrote:
>>
>>> Hello,
>>>
>>> I'd be interested in the OOM exception. The one below aborts the
>>> parsing. Can you open a PDFBox issue and attach your PDF? We could just
>>> skip the table here instead of failing.
>>>
>>> Re the OOM we'd also need a PDF.
>>>
>>> Skipping parsing of embedded ttf will possibly have a negative impact on
>>> text extraction.
>>>
>>> Tilman
>>>
>>>
>>> Am 03.11.2019 um 10:38 schrieb Slava G:
>>> > Hi,
>>> > In some PDF files parsing we see different errors related to PDF
>>> > parsing, one is OutOfMemmory exception during pdf parsing and another:
>>> >
>>> > WARN      - Could not read embedded TTF for font ABCDEE+Segoe
>>> > UI,BoldItalic
>>> > java.io.IOException: Kerning sub-table too short, got 0 bytes, expect
>>> > 6 or more.
>>> > at
>>> >
>>> org.apache.fontbox.ttf.KerningSubtable.readSubtable0(KerningSubtable.java:191)
>>> > at org.apache.fontbox.ttf.KerningSubtable.read(KerningSubtable.java:70)
>>> > at org.apache.fontbox.ttf.KerningTable.read(KerningTable.java:80)
>>> > at org.apache.fontbox.ttf.TrueTypeFont.readTable(TrueTypeFont.java:353)
>>> > at org.apache.fontbox.ttf.TTFParser.parseTables(TTFParser.java:173)
>>> > at org.apache.fontbox.ttf.TTFParser.parse(TTFParser.java:150)
>>> > at org.apache.fontbox.ttf.TTFParser.parse(TTFParser.java:106)
>>> > at
>>> >
>>> org.apache.pdfbox.pdmodel.font.PDTrueTypeFont.<init>(PDTrueTypeFont.java:198)
>>> > at
>>> >
>>> org.apache.pdfbox.pdmodel.font.PDFontFactory.createFont(PDFontFactory.java:75)
>>> > at org.apache.pdfbox.pdmodel.PDResources.getFont(PDResources.java:146)
>>> > at
>>> >
>>> org.apache.pdfbox.contentstream.operator.text.SetFontAndSize.process(SetFontAndSize.java:60)
>>> > at
>>> >
>>> org.apache.pdfbox.contentstream.PDFStreamEngine.processOperator(PDFStreamEngine.java:869)
>>> > at
>>> >
>>> org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:505)
>>> > at
>>> >
>>> org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:479)
>>> > at
>>> >
>>> org.apache.pdfbox.contentstream.PDFStreamEngine.processPage(PDFStreamEngine.java:152)
>>> > at
>>> >
>>> org.apache.pdfbox.text.LegacyPDFStreamEngine.processPage(LegacyPDFStreamEngine.java:139)
>>> > at
>>> >
>>> org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:391)
>>> > at org.apache.tika.parser.pdf.PDF2XHTML.processPage(PDF2XHTML.java:153)
>>> > at
>>> >
>>> org.apache.tika.parser.pdf.AbstractPDF2XHTML.processPages(AbstractPDF2XHTML.java:835)
>>> > at
>>> >
>>> org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:266)
>>> > at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:124)
>>> > at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:172)
>>> >
>>> > How can I skip parsing of embedded TTF inside PDF ?
>>> >
>>> > Thanks
>>>
>>>
>>>
>

Re: How to skip parsing embedded TTF inside PDF

Reply via email to