pdftohtml produces background images which (x,y) position specified on the page's mark up. It creates images for the underlines of text and also for blocked sections (with visible frames), foreign language text, . . .
programmatically scanning those background images to find out lines and boxes is easy, but how could you detect text (other than by exclusion) and the language of that text? I asked basically the same question on a gimpusers's forum: https://www.gimpusers.com/forums/gimp-user/21659-approches-used-for-language-detection-on-images they told me OCR kinds of folks should know best: lbrtchx [email protected]:approches used for language detection on images ... -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAFakBwhJN1JWMHg-h3nsS8t0FEpP%2BkGZXUjsvJOy%2BKb2w_f0JQ%40mail.gmail.com.

