Hello,
I knew something was fishy about my theory ;) Thank you for taking the
time to explain.
I didn't realize each font's glyph had a unique name - the symbolic
fonts I'm working with are coming in with nondescript names, like, "c85"
for a glyph with code 85. I simply assumed it was just the code
converted to string, not a "name", by bad. Some fonts are coming in with
glyph names like ".notdef". So I'm getting a "No Unicode mapping for
.notdef (10) in font Times-Roman" warning etc.
I guess, for "c85" I'll have to make sure this "name" is unique across
different fonts I'm working with and then add it to the non-standard
glyph list. Or, eat up the warning and do some post processing.
What, I think, would be helpful in cases like this is to know if a glyph
was converted or it is the original code point. I will play with it for
a bit and post a snippet if I come up with anything useful.
Thank you I can see the logic now,
Zeev
On 04/29/2015 01:32 AM, John Hewson wrote:
Hi Zeev,
On 28 Apr 2015, at 12:50, Zeev Sands <zeev.sa...@gmail.com> wrote:
Hello everyone,
I've been using pdfbox 2.0 for a couple of weeks and came across an issue with
a some symbol fonts (WPIconicSymbolsA and WPTypographicSymbols):
I needed to convert the symbols to their unicode equivalents, so I cooked up a
small class to do that. No problems there.
It’s not clear exactly what you’re trying to do, you’re talking about
extracting text from a PDF? I’m going to assume that you are.
My issue is - some of the symbols coming in are already being converted and
some are not. I do see that there is a list of glyphs that is being loaded to
do just that (glyphlist.txt) and there is an additional list (additional.txt)
for more glyphs. What I don't understand is how a glyph can be mapped without
specifying a font name, for example in WPIconicSymbolsA dec 33 is an outline of
a heart, in WPTypographicSymbols dec 33 is a large filled dot.
PDF allows any “simple” font to have a PostScript Type 1 encoding overlaid onto
it, so even though the font may be a TTF, there’s another layer of encoding. In
some cases the original fonts encoding is stripped, so this is the only
encoding, in other cases the PostScript encoding is empty and the TTF’s
built-in encoding takes over.
Type 1 fonts pre-date Unicode. In a Type 1 font each glyph has a name, which is a
string, such as “Euro”. An encoding is a map of 8-bit codes to names, for example
WinAnsiEncoding is the Type 1 version of the familiar Windows-1252 encoding. So
we’d have 128 => “Euro”, in that case.
Later on, when Unicode was created, Adobe provided the glyphlist.txt to map from the
standard glyph names to Unicode code points, e.g. “Euro” => U+20AC. Combined with a
Type 1 encoding, this lets us read a code in a PDF file and convert it to Unicode, e.g.
128 => “Euro” => U+20AC. This is a global mapping, so we don’t need one per font.
Some fonts use non-standard names for glyphs, usually because the glyph is
unusual and no standard exists. PDF provides numerous mechanisms for such
glyphs to be mapped to Unicode and one of these is to look up the name in the
standard glyph list. PDFBox ships with an additional, non-standard glyph list
which covers some commonly encountered glyphs such as those found in TeX. This
is a bit of a hack, but such typically don’t use any of the other Unicode
mechanisms provided by PDF, so this is a last resort for mapping such glyphs to
Unicode.
So to be specific, my questions are :
Is there any way to give pdf box a map *per font*?
A glyph’s name should uniquely identify that glyph, so this shouldn’t be
necessary. Just add the missing names to additional.txt.
What is the philosophy of glyph conversion how are different fonts converted
to different unicode characters?
Hopefully I’ve covered that above. The overall philosophy is to avoid
hard-coding where possible and infer Unicode from the PDF wherever possible.
Please, let me know if I am looking at the whole thing incorrectly. Perhaps
there is an easier way…
If you upload the PDF to a public URL then I can take a look at it and see
exactly what the issue is.
Thank you,
Zeev
---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org