tracking missing Unicode mappings?

Allison, Timothy B. Thu, 21 Sep 2017 13:08:56 -0700

All,

How much effort would it be to track/calculate a ratio of characters with 
missing Unicode mappings to those with mappings for a given page?  It would be 
neat after trying to extract text from a page to be able to tell how many 
characters are lost.  We could use this info on Tika to determine whether or 
not to run OCR on a given page.


I see that there’s currently a Set<String> for tracking which characters have a 
missing Unicode mapping to limit duplicate logging.  If we could change that to 
a Map<String,Int> we could track the occurrences.

Is there an easy enough way to get the fonts after processing a page and then 
getting this info?  Are we doing any static caching of fonts that would prevent 
accurate counts?

Thank you.

         Best,

                  Tim

tracking missing Unicode mappings?

Reply via email to