RE: tracking missing Unicode mappings?

2017-09-21 Thread Allison, Timothy B.
Perfect. Thank you. I'll open an issue and draft a patch. -Original Message- From: Tilman Hausherr [mailto:thaush...@t-online.de] Sent: Thursday, September 21, 2017 4:21 PM To: users@pdfbox.apache.org Subject: Re: tracking missing Unicode mappings? The standard 14 fonts are cached

Re: tracking missing Unicode mappings?

2017-09-21 Thread Tilman Hausherr
The standard 14 fonts are cached, but these shouldn't bring any text extraction trouble. So all needed would be a map as described for the PDFont type. Now how to access the fonts... if you grab the TextPosition objects in an extension of PDFTextStripper (e.g. in the showGlyph method)  you

tracking missing Unicode mappings?

2017-09-21 Thread Allison, Timothy B.
All, How much effort would it be to track/calculate a ratio of characters with missing Unicode mappings to those with mappings for a given page? It would be neat after trying to extract text from a page to be able to tell how many characters are lost. We could use this info on Tika to