Re: [tesseract-ocr] approches used for language detection on images ...

Lorenzo Bolzani Sat, 01 Feb 2020 07:22:38 -0800

You can try some machine learning based text detection, like this one for
example:


https://www.pyimagesearch.com/2018/08/20/opencv-text-detection-east-text-detector/
https://github.com/argman/EAST

It's not so easy to use because, as you can see in the images, you are
going to get multiple boxes. So you need a threshold based aggregation step
to put together the real text blocks.

If your text is simple and uniform, like the one from pdf or html
rendering, something like this may work too:

https://www.pyimagesearch.com/2017/07/17/credit-card-ocr-with-opencv-and-python/
https://github.com/qzane/text-detection/blob/master/TextDetect.py


About the language of the text a brute force approach could be to try
different languages with tesseract and see which one gives you the highest
confidence. Other than that you might try a simple "character detection"
with a few key characters for each language and see where you get the best
score (for example with opencv template matching) but I would expect a lot
of errors if the text uses different fonts and sizes.

If all the languages use the same alphabet, latin for example, you can use
a generic one ("eng") and do a character distribution analysis to find the
original language and process it again with the correct tesseract language:

https://data-science-blog.com/blog/2018/11/12/language-detecting-with-sklearn-by-determining-letter-frequencies/
https://appliedmachinelearning.blog/2017/04/30/language-identification-from-texts-using-bi-gram-model-pythonnltk/

Finally, for different alphabets, you could also train a very simple neural
network to do the classification (google "MNIST CNN"), the most complex
part being preparing the dataset.


Lorenzo






Il giorno sab 1 feb 2020 alle ore 12:26 Albretch Mueller <[email protected]>
ha scritto:

>  pdftohtml produces background images which (x,y) position specified
> on the page's mark up. It creates images for the underlines of text
> and also for blocked sections (with visible frames), foreign language
> text, . . .
>
>  programmatically scanning those background images to find out lines
> and boxes is easy, but how could you detect text (other than by
> exclusion) and the language of that text?
>
>  I asked basically the same question on a gimpusers's forum:
>
>
> https://www.gimpusers.com/forums/gimp-user/21659-approches-used-for-language-detection-on-images
>
>  they told me OCR kinds of folks should know best:
>
>  lbrtchx
>  [email protected]:approches used for language detection
> on images ...
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/CAFakBwhJN1JWMHg-h3nsS8t0FEpP%2BkGZXUjsvJOy%2BKb2w_f0JQ%40mail.gmail.com
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAMgOLLzAopK0f22rBa7fmyK6jZ3JY1oLKVTBs-HpQCsyCxCs%3DQ%40mail.gmail.com.

Re: [tesseract-ocr] approches used for language detection on images ...

Reply via email to