All, How much effort would it be to track/calculate a ratio of characters with missing Unicode mappings to those with mappings for a given page? It would be neat after trying to extract text from a page to be able to tell how many characters are lost. We could use this info on Tika to determine whether or not to run OCR on a given page.
I see that there’s currently a Set<String> for tracking which characters have a missing Unicode mapping to limit duplicate logging. If we could change that to a Map<String,Int> we could track the occurrences. Is there an easy enough way to get the fonts after processing a page and then getting this info? Are we doing any static caching of fonts that would prevent accurate counts? Thank you. Best, Tim