Hi Zeev, > On 28 Apr 2015, at 12:50, Zeev Sands <zeev.sa...@gmail.com> wrote: > > Hello everyone, > > I've been using pdfbox 2.0 for a couple of weeks and came across an issue > with a some symbol fonts (WPIconicSymbolsA and WPTypographicSymbols): > I needed to convert the symbols to their unicode equivalents, so I cooked up > a small class to do that. No problems there.
It’s not clear exactly what you’re trying to do, you’re talking about extracting text from a PDF? I’m going to assume that you are. > My issue is - some of the symbols coming in are already being converted and > some are not. I do see that there is a list of glyphs that is being loaded to > do just that (glyphlist.txt) and there is an additional list (additional.txt) > for more glyphs. What I don't understand is how a glyph can be mapped without > specifying a font name, for example in WPIconicSymbolsA dec 33 is an outline > of a heart, in WPTypographicSymbols dec 33 is a large filled dot. PDF allows any “simple” font to have a PostScript Type 1 encoding overlaid onto it, so even though the font may be a TTF, there’s another layer of encoding. In some cases the original fonts encoding is stripped, so this is the only encoding, in other cases the PostScript encoding is empty and the TTF’s built-in encoding takes over. Type 1 fonts pre-date Unicode. In a Type 1 font each glyph has a name, which is a string, such as “Euro”. An encoding is a map of 8-bit codes to names, for example WinAnsiEncoding is the Type 1 version of the familiar Windows-1252 encoding. So we’d have 128 => “Euro”, in that case. Later on, when Unicode was created, Adobe provided the glyphlist.txt to map from the standard glyph names to Unicode code points, e.g. “Euro” => U+20AC. Combined with a Type 1 encoding, this lets us read a code in a PDF file and convert it to Unicode, e.g. 128 => “Euro” => U+20AC. This is a global mapping, so we don’t need one per font. Some fonts use non-standard names for glyphs, usually because the glyph is unusual and no standard exists. PDF provides numerous mechanisms for such glyphs to be mapped to Unicode and one of these is to look up the name in the standard glyph list. PDFBox ships with an additional, non-standard glyph list which covers some commonly encountered glyphs such as those found in TeX. This is a bit of a hack, but such typically don’t use any of the other Unicode mechanisms provided by PDF, so this is a last resort for mapping such glyphs to Unicode. > So to be specific, my questions are : > > Is there any way to give pdf box a map *per font*? A glyph’s name should uniquely identify that glyph, so this shouldn’t be necessary. Just add the missing names to additional.txt. > What is the philosophy of glyph conversion how are different fonts converted > to different unicode characters? Hopefully I’ve covered that above. The overall philosophy is to avoid hard-coding where possible and infer Unicode from the PDF wherever possible. > Please, let me know if I am looking at the whole thing incorrectly. Perhaps > there is an easier way… If you upload the PDF to a public URL then I can take a look at it and see exactly what the issue is. > Thank you, > Zeev > --------------------------------------------------------------------- To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org For additional commands, e-mail: users-h...@pdfbox.apache.org