Hi Zeev,

> On 28 Apr 2015, at 12:50, Zeev Sands <zeev.sa...@gmail.com> wrote:
> 
> Hello everyone,
> 
> I've been using pdfbox 2.0 for a couple of weeks and came across an issue 
> with a some symbol fonts (WPIconicSymbolsA and WPTypographicSymbols):
> I needed to convert the symbols to their unicode equivalents, so I cooked up 
> a small class to do that. No problems there.

It’s not clear exactly what you’re trying to do, you’re talking about 
extracting text from a PDF? I’m going to assume that you are.

> My issue is - some of the symbols coming in are already being converted and 
> some are not. I do see that there is a list of glyphs that is being loaded to 
> do just that (glyphlist.txt) and there is an additional list (additional.txt) 
> for more glyphs. What I don't understand is how a glyph can be mapped without 
> specifying a font name, for example in WPIconicSymbolsA dec 33 is an outline 
> of a heart, in WPTypographicSymbols dec 33 is a large filled dot.

PDF allows any “simple” font to have a PostScript Type 1 encoding overlaid onto 
it, so even though the font may be a TTF, there’s another layer of encoding. In 
some cases the original fonts encoding is stripped, so this is the only 
encoding, in other cases the PostScript encoding is empty and the TTF’s 
built-in encoding takes over.

Type 1 fonts pre-date Unicode. In a Type 1 font each glyph has a name, which is 
a string, such as “Euro”. An encoding is a map of 8-bit codes to names, for 
example WinAnsiEncoding is the Type 1 version of the familiar Windows-1252 
encoding. So we’d have 128 => “Euro”, in that case.

Later on, when Unicode was created, Adobe provided the glyphlist.txt to map 
from the standard glyph names to Unicode code points, e.g. “Euro” => U+20AC. 
Combined with a Type 1 encoding, this lets us read a code in a PDF file and 
convert it to Unicode, e.g. 128 => “Euro” => U+20AC. This is a global mapping, 
so we don’t need one per font.

Some fonts use non-standard names for glyphs, usually because the glyph is 
unusual and no standard exists. PDF provides numerous mechanisms for such 
glyphs to be mapped to Unicode and one of these is to look up the name in the 
standard glyph list. PDFBox ships with an additional, non-standard glyph list 
which covers some commonly encountered glyphs such as those found in TeX. This 
is a bit of a hack, but such typically don’t use any of the other Unicode 
mechanisms provided by PDF, so this is a last resort for mapping such glyphs to 
Unicode.

> So to be specific, my questions are :
> 
>  Is there any way to give pdf box a map *per font*?

A glyph’s name should uniquely identify that glyph, so this shouldn’t be 
necessary. Just add the missing names to additional.txt.

>  What is the philosophy of glyph conversion how are different fonts converted 
> to different unicode characters?

Hopefully I’ve covered that above. The overall philosophy is to avoid 
hard-coding where possible and infer Unicode from the PDF wherever possible.

> Please, let me know if I am looking at the whole thing incorrectly. Perhaps 
> there is an easier way…

If you upload the PDF to a public URL then I can take a look at it and see 
exactly what the issue is.

> Thank you,
> Zeev
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org

Reply via email to