Hello,

I knew something was fishy about my theory ;) Thank you for taking the time to explain.

I didn't realize each font's glyph had a unique name - the symbolic fonts I'm working with are coming in with nondescript names, like, "c85" for a glyph with code 85. I simply assumed it was just the code converted to string, not a "name", by bad. Some fonts are coming in with glyph names like ".notdef". So I'm getting a "No Unicode mapping for .notdef (10) in font Times-Roman" warning etc.

I guess, for "c85" I'll have to make sure this "name" is unique across different fonts I'm working with and then add it to the non-standard glyph list. Or, eat up the warning and do some post processing.

What, I think, would be helpful in cases like this is to know if a glyph was converted or it is the original code point. I will play with it for a bit and post a snippet if I come up with anything useful.

Thank you I can see the logic now,
Zeev

On 04/29/2015 01:32 AM, John Hewson wrote:
Hi Zeev,

On 28 Apr 2015, at 12:50, Zeev Sands <zeev.sa...@gmail.com> wrote:

Hello everyone,

I've been using pdfbox 2.0 for a couple of weeks and came across an issue with 
a some symbol fonts (WPIconicSymbolsA and WPTypographicSymbols):
I needed to convert the symbols to their unicode equivalents, so I cooked up a 
small class to do that. No problems there.
It’s not clear exactly what you’re trying to do, you’re talking about 
extracting text from a PDF? I’m going to assume that you are.

My issue is - some of the symbols coming in are already being converted and 
some are not. I do see that there is a list of glyphs that is being loaded to 
do just that (glyphlist.txt) and there is an additional list (additional.txt) 
for more glyphs. What I don't understand is how a glyph can be mapped without 
specifying a font name, for example in WPIconicSymbolsA dec 33 is an outline of 
a heart, in WPTypographicSymbols dec 33 is a large filled dot.
PDF allows any “simple” font to have a PostScript Type 1 encoding overlaid onto 
it, so even though the font may be a TTF, there’s another layer of encoding. In 
some cases the original fonts encoding is stripped, so this is the only 
encoding, in other cases the PostScript encoding is empty and the TTF’s 
built-in encoding takes over.

Type 1 fonts pre-date Unicode. In a Type 1 font each glyph has a name, which is a 
string, such as “Euro”. An encoding is a map of 8-bit codes to names, for example 
WinAnsiEncoding is the Type 1 version of the familiar Windows-1252 encoding. So 
we’d have 128 => “Euro”, in that case.

Later on, when Unicode was created, Adobe provided the glyphlist.txt to map from the 
standard glyph names to Unicode code points, e.g. “Euro” => U+20AC. Combined with a 
Type 1 encoding, this lets us read a code in a PDF file and convert it to Unicode, e.g. 
128 => “Euro” => U+20AC. This is a global mapping, so we don’t need one per font.

Some fonts use non-standard names for glyphs, usually because the glyph is 
unusual and no standard exists. PDF provides numerous mechanisms for such 
glyphs to be mapped to Unicode and one of these is to look up the name in the 
standard glyph list. PDFBox ships with an additional, non-standard glyph list 
which covers some commonly encountered glyphs such as those found in TeX. This 
is a bit of a hack, but such typically don’t use any of the other Unicode 
mechanisms provided by PDF, so this is a last resort for mapping such glyphs to 
Unicode.

So to be specific, my questions are :

  Is there any way to give pdf box a map *per font*?
A glyph’s name should uniquely identify that glyph, so this shouldn’t be 
necessary. Just add the missing names to additional.txt.

  What is the philosophy of glyph conversion how are different fonts converted 
to different unicode characters?
Hopefully I’ve covered that above. The overall philosophy is to avoid 
hard-coding where possible and infer Unicode from the PDF wherever possible.

Please, let me know if I am looking at the whole thing incorrectly. Perhaps 
there is an easier way…
If you upload the PDF to a public URL then I can take a look at it and see 
exactly what the issue is.

Thank you,
Zeev


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org

Reply via email to