All,

How much effort would it be to track/calculate a ratio of characters with 
missing Unicode mappings to those with mappings for a given page?  It would be 
neat after trying to extract text from a page to be able to tell how many 
characters are lost.  We could use this info on Tika to determine whether or 
not to run OCR on a given page.

I see that there’s currently a Set<String> for tracking which characters have a 
missing Unicode mapping to limit duplicate logging.  If we could change that to 
a Map<String,Int> we could track the occurrences.

Is there an easy enough way to get the fonts after processing a page and then 
getting this info?  Are we doing any static caching of fonts that would prevent 
accurate counts?

Thank you.

         Best,

                  Tim

Reply via email to