> On 29 Apr 2015, at 16:34, Zeev Sands <zeev.sa...@gmail.com> wrote:
> 
> Hello,
> 
> I knew something was fishy about my theory ;) Thank you for taking the time 
> to explain.
> 
> I didn't realize each font's glyph had a unique name - the symbolic fonts I'm 
> working with are coming in with nondescript names, like, "c85" for a glyph 
> with code 85. I simply assumed it was just the code converted to string, not 
> a "name", by bad. Some fonts are coming in with glyph names like ".notdef". 
> So I'm getting a "No Unicode mapping for .notdef (10) in font Times-Roman" 
> warning etc.

“.notdef” is special, it means the name is missing. So PDFBox wasn’t able to 
map code 10 to any name. Either that’s a missing glyph, or the font was 
embedded as a TTF with raw GIDs and no encoding information.

> I guess, for "c85" I'll have to make sure this "name" is unique across 
> different fonts I'm working with and then add it to the non-standard glyph 
> list. Or, eat up the warning and do some post processing.

That’s an unfortunate name, because it defeats the point in having names and 
codes be separate. I can’t say I’m surprised to see this. If necessary you can 
subclass PDFTextStripper (or, in fact any subclass of PDFStreamEngine) and add 
your own per-font glyph mapping by overriding showGlyph.

> What, I think, would be helpful in cases like this is to know if a glyph was 
> converted or it is the original code point. I will play with it for a bit and 
> post a snippet if I come up with anything useful.

I’m not quite sure what you mean, do you mean the last-resort case where if we 
fail to find a mapping we simply coerce the PDF character code to Unicode? 
(Yes, it’s a bad idea, but it’s what Acrobat does). If so, that code can be 
found in PDFTextStreamEngine. You can override showGlyph to hook into this, by 
checking if unicode == null before calling super.showGlyph.

— John

> Thank you I can see the logic now,
> Zeev
> 
> On 04/29/2015 01:32 AM, John Hewson wrote:
>> Hi Zeev,
>> 
>>> On 28 Apr 2015, at 12:50, Zeev Sands <zeev.sa...@gmail.com> wrote:
>>> 
>>> Hello everyone,
>>> 
>>> I've been using pdfbox 2.0 for a couple of weeks and came across an issue 
>>> with a some symbol fonts (WPIconicSymbolsA and WPTypographicSymbols):
>>> I needed to convert the symbols to their unicode equivalents, so I cooked 
>>> up a small class to do that. No problems there.
>> It’s not clear exactly what you’re trying to do, you’re talking about 
>> extracting text from a PDF? I’m going to assume that you are.
>> 
>>> My issue is - some of the symbols coming in are already being converted and 
>>> some are not. I do see that there is a list of glyphs that is being loaded 
>>> to do just that (glyphlist.txt) and there is an additional list 
>>> (additional.txt) for more glyphs. What I don't understand is how a glyph 
>>> can be mapped without specifying a font name, for example in 
>>> WPIconicSymbolsA dec 33 is an outline of a heart, in WPTypographicSymbols 
>>> dec 33 is a large filled dot.
>> PDF allows any “simple” font to have a PostScript Type 1 encoding overlaid 
>> onto it, so even though the font may be a TTF, there’s another layer of 
>> encoding. In some cases the original fonts encoding is stripped, so this is 
>> the only encoding, in other cases the PostScript encoding is empty and the 
>> TTF’s built-in encoding takes over.
>> 
>> Type 1 fonts pre-date Unicode. In a Type 1 font each glyph has a name, which 
>> is a string, such as “Euro”. An encoding is a map of 8-bit codes to names, 
>> for example WinAnsiEncoding is the Type 1 version of the familiar 
>> Windows-1252 encoding. So we’d have 128 => “Euro”, in that case.
>> 
>> Later on, when Unicode was created, Adobe provided the glyphlist.txt to map 
>> from the standard glyph names to Unicode code points, e.g. “Euro” => U+20AC. 
>> Combined with a Type 1 encoding, this lets us read a code in a PDF file and 
>> convert it to Unicode, e.g. 128 => “Euro” => U+20AC. This is a global 
>> mapping, so we don’t need one per font.
>> 
>> Some fonts use non-standard names for glyphs, usually because the glyph is 
>> unusual and no standard exists. PDF provides numerous mechanisms for such 
>> glyphs to be mapped to Unicode and one of these is to look up the name in 
>> the standard glyph list. PDFBox ships with an additional, non-standard glyph 
>> list which covers some commonly encountered glyphs such as those found in 
>> TeX. This is a bit of a hack, but such typically don’t use any of the other 
>> Unicode mechanisms provided by PDF, so this is a last resort for mapping 
>> such glyphs to Unicode.
>> 
>>> So to be specific, my questions are :
>>> 
>>>  Is there any way to give pdf box a map *per font*?
>> A glyph’s name should uniquely identify that glyph, so this shouldn’t be 
>> necessary. Just add the missing names to additional.txt.
>> 
>>>  What is the philosophy of glyph conversion how are different fonts 
>>> converted to different unicode characters?
>> Hopefully I’ve covered that above. The overall philosophy is to avoid 
>> hard-coding where possible and infer Unicode from the PDF wherever possible.
>> 
>>> Please, let me know if I am looking at the whole thing incorrectly. Perhaps 
>>> there is an easier way…
>> If you upload the PDF to a public URL then I can take a look at it and see 
>> exactly what the issue is.
>> 
>>> Thank you,
>>> Zeev
>>> 
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
>> For additional commands, e-mail: users-h...@pdfbox.apache.org
>> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
> For additional commands, e-mail: users-h...@pdfbox.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org

Reply via email to