Re: [tesseract-ocr] Detecting language automatically

Charles Cho Thu, 25 Mar 2021 01:41:47 -0700

Hi,

I have investigated on trying to detect language automatically.
I referred to these links. Thank you, Merlijin.
https://archive.org/services/docs/api/ocr.html#autonomous-mode
https://git.archive.org/www/tesseract/-/blob/master/main.py#L757


So in my analysis, it used OSD of tesseract engine to detect layout and 
script.
After detect script, it detects languages on the script.

So I tried to use OSD engine mode based on textfairy which is Android OCR 
app based on tesseract 4.1.1.
But it doesn't work and I can't make sure how I can use OSD engine mode in 
Android.
I set 'osd' as language option string and used osd.traindata and set 
'OEM_OSD_ONLY' as engine mode.
But it doesn't work.

Hope anyone can help you to use OSD engine mode in Android.

Thank you.
Best,
Charles.

On Monday, March 22, 2021 at 10:28:38 AM UTC+8 Charles Cho wrote:

> Hi, Merlijn.
>
> Thanks for your kind response.
>
> Regarding autonomous mode, I'm trying to find such module for Android.
> But I found nothing. I will try more.
>
> >I am not sure what you're finding on google play store, but I have found
> >there to be no limitation to the amount of languages that can be used
> >during OCR. Keep in mind that using more languages will slow down the
> >OCR process.
> It's textfairy, open source app.
> https://play.google.com/store/apps/details?id=com.renard.ocr
>
> Your response is really helpful.
>
> Best,
> Charles.
> On Sunday, March 21, 2021 at 8:29:13 AM UTC+8 Merlijn Wajer wrote:
>
>> Hi, 
>>
>> On 19/03/2021 10:11, Charles Cho wrote: 
>> > Hello, 
>> > I'm working on a ocr android app based on tesseract. 
>> > I want to add feature that detects language automatically and recognize 
>> > at least 2 languages at once. 
>> > I have investigated on that for a while so I know that I have to 
>> specify 
>> > language for tesseract. 
>> > Then how can I implement auto detection of language? 
>>
>> Not exactly a mobile use case, but you can read how the Internet Archive 
>> does this (I coined it "autonomous mode", where the software just 
>> figures out the scripts and languages): 
>>
>> https://archive.org/services/docs/api/ocr.html#autonomous-mode 
>>
>> And the code is available, here (I plan to split out the archive.org 
>> specific code from the python code that invokes Tesseract and performs 
>> heuristics like script detection): 
>>
>> https://git.archive.org/www/tesseract/-/blob/master/main.py#L757 
>>
>> the tl;dr is to first perform script detection, and use the detected 
>> script to OCR the page - then use language detection libraries to guess 
>> the languages on the page. 
>>
>> > And tesseract on google play store can recognize 3 languages at once. 
>> > Is it maximum? 
>>
>> I am not sure what you're finding on google play store, but I have found 
>> there to be no limitation to the amount of languages that can be used 
>> during OCR. Keep in mind that using more languages will slow down the 
>> OCR process. 
>>
>> > Any help and advice would be really appreciated. 
>>
>> Hope this helps. 
>>
>> Cheers, 
>> Merlijn 
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/f05cb3fa-b7da-491f-930b-127e5784abc5n%40googlegroups.com.

Re: [tesseract-ocr] Detecting language automatically

Reply via email to