Hi, I am trying to extract Sinhala and Tamil text from PDFs, and am facing a problem extracting text correctly when the PDF uses Unicode Fonts "Iskoola Pota" (Sinhala) or "Latha" (Tamil).
While the extraction works as expected when the encoding is WinAnsi, if the encoding is "Identity-H" some letters tend to be jumbled (valid Sinhala or Tamil characters, but wrong) and the jumbled letters differ from PDF to PDF. This is because the toUnicode table for such PDFs are incorrect, mapping glyphs to the wrong Unicode values. I came across the solution for the Identity-h problem for CJK fonts using CMap files, but the CMap files for these two fonts are not available. I would be grateful if you could let me know if there is any way to overwrite the toUnicode map and use a custom map in extraction, which correctly maps glyphs to values, or if there is any other effective solution for this problem. Thank you!

