When extracting text in Acrobat I get about 80% correct text, AFAICT. Given the script(s) used there might be a bigger issue - in order to understand it (i.e. how to encode complex scripts) it will be very useful to check out the video recording of this excellent presentation (given at PDF Days Europe 2016 in Berlin, Germany, cf. https://www.pdfa.org/slide-decks-and-video-recordings-of-the-pdf-days-europe-2016/ ):
PDF and OpenType PDF and OpenType technology: The ideal match or an uneasy compromise? Benoît Lagae, iText and Alexey Subach, Dual Lab https://youtu.be/qZnXZppH2KI Presentation slides: https://www.dropbox.com/s/j5ocs7olx7jioua/1100%20Presentation_OpenType.pdf?dl=1 I fear that just “fixing” ToUnicode tables (if at all) will not be a 100% solution… Olaf > On 30 Oct 2016, at 17:22, Maryam Z <[email protected]> wrote: > > Hi, > > Thank you for the quick reply. > > I did in fact try the "Acrobat test" and just copying and pasting produces > the same results (jumbled) as the PDFBox extraction. > > The Font Map shows glyphs being mapped to the wrong Unicode values. But > since we know the correct mapping between glyphs and Unicode values can't > we overwrite the default mapping to use a custom mapping? > > Please find below the link to a PDF with the issue. > https://goo.gl/vrXzBv > <http://wikisend.com/download/304776/font_test.pdf> > > Thank you very much for your assistance, once again. > > > On Sun, Oct 30, 2016 at 8:52 PM, Andreas Lehmkuehler <[email protected]> > wrote: > >> Hi, >> >> Am 30.10.2016 um 07:46 schrieb Maryam Z: >> >>> Hi, >>> >>> I am trying to extract Sinhala and Tamil text from PDFs, and am facing a >>> problem extracting text correctly when the PDF uses Unicode Fonts "Iskoola >>> Pota" (Sinhala) or "Latha" (Tamil). >>> >>> While the extraction works as expected when the encoding is WinAnsi, if >>> the >>> encoding is "Identity-H" some letters tend to be jumbled (valid Sinhala or >>> Tamil characters, but wrong) and the jumbled letters differ from PDF to >>> PDF. This is because the toUnicode table for such PDFs are incorrect, >>> mapping glyphs to the wrong Unicode values. >>> >>> I came across the solution for the Identity-h problem for CJK fonts using >>> CMap files, but the CMap files for these two fonts are not available. >>> >>> I would be grateful if you could let me know if there is any way to >>> overwrite the toUnicode map and use a custom map in extraction, which >>> correctly maps glyphs to values, or if there is any other effective >>> solution for this problem. >>> >> Did you perform the "acrobat test", see [1] ? >> >> What version of PDFBox are you using? >> >> Can you share a sample pdf with us (provide a link to a public download >> site/sharehoster)? >> >> >>> Thank you! >>> >>> >> BR >> Andreas >> >> [1] http://pdfbox.apache.org/2.0/faq.html#notext >> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: [email protected] >> For additional commands, e-mail: [email protected] >> >> --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]

