Well, it turned out that a wrong CMap in the (truetype) font was the
origin of this problem. When copy-pasting text from adobe reader the
same artifacts where shown...
In the faulty font Glyph #218 (sacute) was mapped to U+0153 (oelig) and
U+0158 (sacute) but, fortunately the postscript names in the glyf table
were present so I could find out the correct unicode (via adobes
glyphmap, sacute=u+0153) and apply a correction mapping on
PDFTextStrippers' text output.
Font CMap: Unicode -> Index
Font Glyf Table: Index -> PSName
Adobe Glyphlist PSName -> Unicode
Wulf
Am 13.07.2011 16:33, schrieb Wulf Berschin:
Hi,
when extracting a bunch of PDF documents in several languages I wondered
why some special characters in some documents where wrong in the
extracted text files.
As it turns out these wrong-decoded PDFs have no or flawed ToUnicode
dictionaries. The fonts are TrueTypes and always embedded,,,
Does somebody knows
- at what circumstances PDF with no or incorrect CMaps are created
- how could I work around this problem?
Since I have the TTFs: could I preload them? Otherwise: Could I correct
the PDFs by replacing the wrong / adding a correct CMap
Thank you for your help.
Wulf