All,
How much effort would it be to track/calculate a ratio of characters with
missing Unicode mappings to those with mappings for a given page? It would be
neat after trying to extract text from a page to be able to tell how many
characters are lost. We could use this info on Tika to determine whether or
not to run OCR on a given page.
I see that there’s currently a Set<String> for tracking which characters have a
missing Unicode mapping to limit duplicate logging. If we could change that to
a Map<String,Int> we could track the occurrences.
Is there an easy enough way to get the fonts after processing a page and then
getting this info? Are we doing any static caching of fonts that would prevent
accurate counts?
Thank you.
Best,
Tim