Re: [tesseract-ocr] Detecting language automatically

Merlijn B.W. Wajer Thu, 25 Mar 2021 11:33:25 -0700

Hi,

On 25/03/2021 19:04, Charles Cho wrote:
> Hi.
> 
> Thank you very much for your kind help, shree.
> I tried to detect script by your help and it worked. Great.
> 
> I have some questions.
> 1. If the image contains texts of different languages in a page, is there 
> any way to detect all of the languages? Now it detects only one language.
> 2. It detects English, German, French as 'Latin'. So how can I distinguish 
> the languages exactly?


The OSD module does not detect language - it detect script, as you also
noted earlier:

>>> So in my analysis, it used OSD of tesseract engine to detect layout and
>>> script.
>>> After detect script, it detects languages on the script.

What's missing is performing OCR using just the script - and then
analysing the corpus to detect the language.

You could use something like this: https://github.com/saffsd/langid.c

Regards,
Merlijn

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/35b6efd2-109f-06a3-6af9-7c8619a52dc3%40archive.org.

Re: [tesseract-ocr] Detecting language automatically

Reply via email to