Re: [tesseract-ocr] Detecting language automatically

shree Thu, 25 Mar 2021 06:49:45 -0700

See https://github.com/tesseract-ocr/tessdoc/blob/master/examples/OSD_example.cc


//Get OSD - new code
    int orient_deg;
    float orient_conf;
    const char* script_name;
    float script_conf;
    api->DetectOrientationScript(&orient_deg, &orient_conf, &script_name, 
&script_conf);
    printf("************\n Orientation in degrees: %d\n Orientation 
confidence: %.2f\n"
    " Script: %s\n Script confidence: %.2f\n",
    orient_deg, orient_conf,
    script_name, script_conf);

On Thursday, March 25, 2021 at 2:11:42 PM UTC+5:30 charles...@gmail.com 
wrote:

> Hi,
>
> I have investigated on trying to detect language automatically.
> I referred to these links. Thank you, Merlijin.
> https://archive.org/services/docs/api/ocr.html#autonomous-mode
> https://git.archive.org/www/tesseract/-/blob/master/main.py#L757
>
> So in my analysis, it used OSD of tesseract engine to detect layout and 
> script.
> After detect script, it detects languages on the script.
>
> So I tried to use OSD engine mode based on textfairy which is Android OCR 
> app based on tesseract 4.1.1.
> But it doesn't work and I can't make sure how I can use OSD engine mode in 
> Android.
> I set 'osd' as language option string and used osd.traindata and set 
> 'OEM_OSD_ONLY' as engine mode.
> But it doesn't work.
>
> Hope anyone can help you to use OSD engine mode in Android.
>
> Thank you.
> Best,
> Charles.
>
> On Monday, March 22, 2021 at 10:28:38 AM UTC+8 Charles Cho wrote:
>
>> Hi, Merlijn.
>>
>> Thanks for your kind response.
>>
>> Regarding autonomous mode, I'm trying to find such module for Android.
>> But I found nothing. I will try more.
>>
>> >I am not sure what you're finding on google play store, but I have found
>> >there to be no limitation to the amount of languages that can be used
>> >during OCR. Keep in mind that using more languages will slow down the
>> >OCR process.
>> It's textfairy, open source app.
>> https://play.google.com/store/apps/details?id=com.renard.ocr
>>
>> Your response is really helpful.
>>
>> Best,
>> Charles.
>> On Sunday, March 21, 2021 at 8:29:13 AM UTC+8 Merlijn Wajer wrote:
>>
>>> Hi, 
>>>
>>> On 19/03/2021 10:11, Charles Cho wrote: 
>>> > Hello, 
>>> > I'm working on a ocr android app based on tesseract. 
>>> > I want to add feature that detects language automatically and 
>>> recognize 
>>> > at least 2 languages at once. 
>>> > I have investigated on that for a while so I know that I have to 
>>> specify 
>>> > language for tesseract. 
>>> > Then how can I implement auto detection of language? 
>>>
>>> Not exactly a mobile use case, but you can read how the Internet Archive 
>>> does this (I coined it "autonomous mode", where the software just 
>>> figures out the scripts and languages): 
>>>
>>> https://archive.org/services/docs/api/ocr.html#autonomous-mode 
>>>
>>> And the code is available, here (I plan to split out the archive.org 
>>> specific code from the python code that invokes Tesseract and performs 
>>> heuristics like script detection): 
>>>
>>> https://git.archive.org/www/tesseract/-/blob/master/main.py#L757 
>>>
>>> the tl;dr is to first perform script detection, and use the detected 
>>> script to OCR the page - then use language detection libraries to guess 
>>> the languages on the page. 
>>>
>>> > And tesseract on google play store can recognize 3 languages at once. 
>>> > Is it maximum? 
>>>
>>> I am not sure what you're finding on google play store, but I have found 
>>> there to be no limitation to the amount of languages that can be used 
>>> during OCR. Keep in mind that using more languages will slow down the 
>>> OCR process. 
>>>
>>> > Any help and advice would be really appreciated. 
>>>
>>> Hope this helps. 
>>>
>>> Cheers, 
>>> Merlijn 
>>>
>>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/20bdef8f-a543-420d-aba8-a9260fe3a28bn%40googlegroups.com.

Re: [tesseract-ocr] Detecting language automatically

Reply via email to