> On 29 Apr 2015, at 16:34, Zeev Sands <zeev.sa...@gmail.com> wrote: > > Hello, > > I knew something was fishy about my theory ;) Thank you for taking the time > to explain. > > I didn't realize each font's glyph had a unique name - the symbolic fonts I'm > working with are coming in with nondescript names, like, "c85" for a glyph > with code 85. I simply assumed it was just the code converted to string, not > a "name", by bad. Some fonts are coming in with glyph names like ".notdef". > So I'm getting a "No Unicode mapping for .notdef (10) in font Times-Roman" > warning etc.
“.notdef” is special, it means the name is missing. So PDFBox wasn’t able to map code 10 to any name. Either that’s a missing glyph, or the font was embedded as a TTF with raw GIDs and no encoding information. > I guess, for "c85" I'll have to make sure this "name" is unique across > different fonts I'm working with and then add it to the non-standard glyph > list. Or, eat up the warning and do some post processing. That’s an unfortunate name, because it defeats the point in having names and codes be separate. I can’t say I’m surprised to see this. If necessary you can subclass PDFTextStripper (or, in fact any subclass of PDFStreamEngine) and add your own per-font glyph mapping by overriding showGlyph. > What, I think, would be helpful in cases like this is to know if a glyph was > converted or it is the original code point. I will play with it for a bit and > post a snippet if I come up with anything useful. I’m not quite sure what you mean, do you mean the last-resort case where if we fail to find a mapping we simply coerce the PDF character code to Unicode? (Yes, it’s a bad idea, but it’s what Acrobat does). If so, that code can be found in PDFTextStreamEngine. You can override showGlyph to hook into this, by checking if unicode == null before calling super.showGlyph. — John > Thank you I can see the logic now, > Zeev > > On 04/29/2015 01:32 AM, John Hewson wrote: >> Hi Zeev, >> >>> On 28 Apr 2015, at 12:50, Zeev Sands <zeev.sa...@gmail.com> wrote: >>> >>> Hello everyone, >>> >>> I've been using pdfbox 2.0 for a couple of weeks and came across an issue >>> with a some symbol fonts (WPIconicSymbolsA and WPTypographicSymbols): >>> I needed to convert the symbols to their unicode equivalents, so I cooked >>> up a small class to do that. No problems there. >> It’s not clear exactly what you’re trying to do, you’re talking about >> extracting text from a PDF? I’m going to assume that you are. >> >>> My issue is - some of the symbols coming in are already being converted and >>> some are not. I do see that there is a list of glyphs that is being loaded >>> to do just that (glyphlist.txt) and there is an additional list >>> (additional.txt) for more glyphs. What I don't understand is how a glyph >>> can be mapped without specifying a font name, for example in >>> WPIconicSymbolsA dec 33 is an outline of a heart, in WPTypographicSymbols >>> dec 33 is a large filled dot. >> PDF allows any “simple” font to have a PostScript Type 1 encoding overlaid >> onto it, so even though the font may be a TTF, there’s another layer of >> encoding. In some cases the original fonts encoding is stripped, so this is >> the only encoding, in other cases the PostScript encoding is empty and the >> TTF’s built-in encoding takes over. >> >> Type 1 fonts pre-date Unicode. In a Type 1 font each glyph has a name, which >> is a string, such as “Euro”. An encoding is a map of 8-bit codes to names, >> for example WinAnsiEncoding is the Type 1 version of the familiar >> Windows-1252 encoding. So we’d have 128 => “Euro”, in that case. >> >> Later on, when Unicode was created, Adobe provided the glyphlist.txt to map >> from the standard glyph names to Unicode code points, e.g. “Euro” => U+20AC. >> Combined with a Type 1 encoding, this lets us read a code in a PDF file and >> convert it to Unicode, e.g. 128 => “Euro” => U+20AC. This is a global >> mapping, so we don’t need one per font. >> >> Some fonts use non-standard names for glyphs, usually because the glyph is >> unusual and no standard exists. PDF provides numerous mechanisms for such >> glyphs to be mapped to Unicode and one of these is to look up the name in >> the standard glyph list. PDFBox ships with an additional, non-standard glyph >> list which covers some commonly encountered glyphs such as those found in >> TeX. This is a bit of a hack, but such typically don’t use any of the other >> Unicode mechanisms provided by PDF, so this is a last resort for mapping >> such glyphs to Unicode. >> >>> So to be specific, my questions are : >>> >>> Is there any way to give pdf box a map *per font*? >> A glyph’s name should uniquely identify that glyph, so this shouldn’t be >> necessary. Just add the missing names to additional.txt. >> >>> What is the philosophy of glyph conversion how are different fonts >>> converted to different unicode characters? >> Hopefully I’ve covered that above. The overall philosophy is to avoid >> hard-coding where possible and infer Unicode from the PDF wherever possible. >> >>> Please, let me know if I am looking at the whole thing incorrectly. Perhaps >>> there is an easier way… >> If you upload the PDF to a public URL then I can take a look at it and see >> exactly what the issue is. >> >>> Thank you, >>> Zeev >>> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org >> For additional commands, e-mail: users-h...@pdfbox.apache.org >> > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org > For additional commands, e-mail: users-h...@pdfbox.apache.org > --------------------------------------------------------------------- To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org For additional commands, e-mail: users-h...@pdfbox.apache.org