Re: [Identity-H][Text Extraction] Overwrite toUnicode Mapping

John Hewson Mon, 31 Oct 2016 13:45:13 -0700

> On 30 Oct 2016, at 09:22, Maryam Z <[email protected]> wrote:
> 
> Hi,
> 
> Thank you for the quick reply.
> 
> I did in fact try the "Acrobat test" and just copying and pasting produces
> the same results (jumbled) as the PDFBox extraction.
> 
> The Font Map shows glyphs being mapped to the wrong Unicode values. But
> since we know the correct mapping between glyphs and Unicode values can't
> we overwrite the default mapping to use a custom mapping?
> 
> Please find below the link to a PDF with the issue.
> https://goo.gl/vrXzBv
> <http://wikisend.com/download/304776/font_test.pdf>

I agree with Olaf that the real issue here is complex scripts.

Looking at the final bullet point on the first page, I see the word:

ලොව 

Which is made up of four glyphs, but only three unicode characters:

ව ො ව

Because one two of the characters are from the same ligature, which spans both
sides of the glyph.

The problem is that when PDF tries to map from glyphs back to characters it 
can’t
work, because there are four glyphs, but only three characters. Because 
different
ligatures share some of the same glyphs…

ෙ  and   ේ   and   ෝ   all start with the same glyph

…there’s no way to map a glyph uniquely to a Unicode character.

So the concept of a ToUnicodeMap is just not powerful enough to express the
mappings needed by complex scripts. So while there may be problems with the
ToUnicodeMap in this particular document, even a perfect ToUnicodeMap can’t
help you.

The only way to embed complex text in a PDF robustly is to use /ActualText via
marked content. Interestingly enough your PDF contains marked content
operators but the Properties resource dictionaries are missing, so we don’t know
anything about the marked content.

Good luck,

— John

> Thank you very much for your assistance, once again.
> 
> 
> On Sun, Oct 30, 2016 at 8:52 PM, Andreas Lehmkuehler <[email protected]>
> wrote:
> 
>> Hi,
>> 
>> Am 30.10.2016 um 07:46 schrieb Maryam Z:
>> 
>>> Hi,
>>> 
>>> I am trying to extract Sinhala and Tamil text from PDFs, and am facing a
>>> problem extracting text correctly when the PDF uses Unicode Fonts "Iskoola
>>> Pota" (Sinhala) or "Latha" (Tamil).
>>> 
>>> While the extraction works as expected when the encoding is WinAnsi, if
>>> the
>>> encoding is "Identity-H" some letters tend to be jumbled (valid Sinhala or
>>> Tamil characters, but wrong) and the jumbled letters differ from PDF to
>>> PDF. This is because the toUnicode table for such PDFs are incorrect,
>>> mapping glyphs to the wrong Unicode values.
>>> 
>>> I came across the solution for the Identity-h problem for CJK fonts using
>>> CMap files, but the CMap files for these two fonts are not available.
>>> 
>>> I would be grateful if you could let me know if there is any way to
>>> overwrite the toUnicode map and use a custom map in extraction, which
>>> correctly maps glyphs to values, or if there is any other effective
>>> solution for this problem.
>>> 
>> Did you perform the "acrobat test", see [1] ?
>> 
>> What version of PDFBox are you using?
>> 
>> Can you share a sample pdf with us (provide a link to a public download
>> site/sharehoster)?
>> 
>> 
>>> Thank you!
>>> 
>>> 
>> BR
>> Andreas
>> 
>> [1] http://pdfbox.apache.org/2.0/faq.html#notext
>> 
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [email protected]
>> For additional commands, e-mail: [email protected]
>> 
>> 

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [Identity-H][Text Extraction] Overwrite toUnicode Mapping

Reply via email to