pdftohtml produces background images which (x,y) position specified
on the page's mark up. It creates images for the underlines of text
and also for blocked sections (with visible frames), foreign language
text, . . .

 programmatically scanning those background images to find out lines
and boxes is easy, but how could you detect text (other than by
exclusion) and the language of that text?

 I asked basically the same question on a gimpusers's forum:

 
https://www.gimpusers.com/forums/gimp-user/21659-approches-used-for-language-detection-on-images

 they told me OCR kinds of folks should know best:

 lbrtchx
 [email protected]:approches used for language detection
on images ...

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAFakBwhJN1JWMHg-h3nsS8t0FEpP%2BkGZXUjsvJOy%2BKb2w_f0JQ%40mail.gmail.com.

Reply via email to