Hi, Thank you for the quick reply.
I did in fact try the "Acrobat test" and just copying and pasting produces the same results (jumbled) as the PDFBox extraction. The Font Map shows glyphs being mapped to the wrong Unicode values. But since we know the correct mapping between glyphs and Unicode values can't we overwrite the default mapping to use a custom mapping? Please find below the link to a PDF with the issue. https://goo.gl/vrXzBv <http://wikisend.com/download/304776/font_test.pdf> Thank you very much for your assistance, once again. On Sun, Oct 30, 2016 at 8:52 PM, Andreas Lehmkuehler <[email protected]> wrote: > Hi, > > Am 30.10.2016 um 07:46 schrieb Maryam Z: > >> Hi, >> >> I am trying to extract Sinhala and Tamil text from PDFs, and am facing a >> problem extracting text correctly when the PDF uses Unicode Fonts "Iskoola >> Pota" (Sinhala) or "Latha" (Tamil). >> >> While the extraction works as expected when the encoding is WinAnsi, if >> the >> encoding is "Identity-H" some letters tend to be jumbled (valid Sinhala or >> Tamil characters, but wrong) and the jumbled letters differ from PDF to >> PDF. This is because the toUnicode table for such PDFs are incorrect, >> mapping glyphs to the wrong Unicode values. >> >> I came across the solution for the Identity-h problem for CJK fonts using >> CMap files, but the CMap files for these two fonts are not available. >> >> I would be grateful if you could let me know if there is any way to >> overwrite the toUnicode map and use a custom map in extraction, which >> correctly maps glyphs to values, or if there is any other effective >> solution for this problem. >> > Did you perform the "acrobat test", see [1] ? > > What version of PDFBox are you using? > > Can you share a sample pdf with us (provide a link to a public download > site/sharehoster)? > > >> Thank you! >> >> > BR > Andreas > > [1] http://pdfbox.apache.org/2.0/faq.html#notext > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [email protected] > For additional commands, e-mail: [email protected] > >

